Hypercomplex deep learning methods, architectures, and apparatus for multimodal small, medium, and large-scale data representation, analysis, and applications

ABSTRACT

A method and system for creating hypercomplex representations of data includes, in one exemplary embodiment, at least one set of training data with associated labels or desired response values, transforming the data and labels into hypercomplex values, methods for defining hypercomplex graphs of functions, training algorithms to minimize the cost of an error function over the parameters in the graph, and methods for reading hierarchical data representations from the resulting graph. Another exemplary embodiment learns hierarchical representations from unlabeled data. The method and system, in another exemplary embodiment, may be employed for biometric identity verification by combining multimodal data collected using many sensors, including, data, for example, such as anatomical characteristics, behavioral characteristics, demographic indicators, artificial characteristics. In other exemplary embodiments, the system and method may learn hypercomplex function approximations in one environment and transfer the learning to other target environments. Other exemplary applications of the hypercomplex deep learning framework include: image segmentation; image quality evaluation; image steganalysis; face recognition; event embedding in natural language processing; machine translation between languages; object recognition; medical applications such as breast cancer mass classification; multispectral imaging; audio processing; color image filtering; and clothing identification.

PRIORITY CLAIM

This application claims priority to U.S. Provisional Application Ser.No. 62/551,901 entitled “Hypercomplex Deep Learning Methods,Architectures, and Apparatus for Multimodal Small, Medium, andLarge-Scale Data Representation, Analysis, and Applications” filed Aug.30, 2018, which is incorporated herein by reference in its entirety.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

This invention was made with government support under RD-0932339 awardedby the National Science Foundation. The government has certain rights inthe invention.

BACKGROUND OF THE INVENTION 1. Field of the Invention

The present invention(s) relate to deep learning networks or graphs, inparticular, hypercomplex deep learning systems, methods, andarchitectures for multimodal small, medium, and large-scale datarepresentation, analysis, and applications.

2. Description of the Relevant Art

Deep learning is a method of discovering hierarchical representationsand abstractions of data; the representations help make order out ofunstructured data such as images, audio, and text. Existing deep neuralnetworks, however, assume that all data is unstructured, yet manypractical engineering problems involve multi-channel data that hasimportant inter-channel relationships which existing deep neuralnetworks cannot model effectively. For example: color images containthree or four related color channels; multispectral images are images ofthe same object at different wavelengths and therefore have significantinter-wavelength relationships; and multi-sensor applications (e.g.sensor arrays for audio, radar, vision, biometrics, etc.) also employdata relationships between channels. Because existing deep learningstructures have difficulty modeling multi-channel data, they requirevast amounts of data for training in order to approximate simpleinter-channel relationships.

An important topic in traditional, real-valued deep learning literatureis the “vanishing gradient” problem, wherein the error terms that arepropagated through the network tend towards zero. This occurs due to useof nonlinearities, such as the hyperbolic tangent, that compress thedata around zero, that is mapping values of larger magnitude to valuesof smaller magnitude. After repeated applications of a nonlinearity, thevalue tends towards zero.

Many application areas have a sparsity of labeled data for deeprepresentation learning. For example, in the fashion industry, there areextensive datasets of images with clothing and labeled attributes suchas color, sleeve length, and so on. However, all of these images aretaken with models in well-lit images. If one wants to identify the sameclothing in images on social media, a mapping from the social mediaimages to the well-lit images would be helpful. Unfortunately, thismapping dataset is extremely small and is difficult to expand.Correspondingly, methods for learning data representations on the large,source dataset of model images and transferring that learning to thetarget task of clothing identification in social media are necessary.

In addition to transferring learning from one domain to another, it isimportant to transfer learning from human to machine. Traditionally,machine expert systems were created according to following steps:Problem domain experts would engineer a set of data representationscalled features; these features would be used to train a fully-connectedneural network; and, finally, the neural network would be used forprediction and classification tasks. Deep learning has automated thedomain expert creation of features, provided that a sufficiently largedataset exists. However, large datasets do not exist for manyapplications, and therefore a method of employing human expert featuresto reduce training set size would be helpful.

One important emerging application of representation learning ismulti-sensor fusion and transfer learning in biometric identityverification systems. The primary objective of biometric identityverification systems is to create automated methods to recognizeuniquely individuals using: (i) anatomical characteristics, for example,DNA, signature, palm prints, fingerprints, finger shape, face, handgeometry, vascular technology, iris, and retina; (ii) behavioralcharacteristics, for example, voice, gait, typing rhythm, and gestures;(iii) demographic indicators, for example, height, race, age, andgender; and (iv) artificial characteristics such as tattoos. Biometricsystems are rapidly being deployed for border security identityverification, prevention of identity theft, and digital device security.

The key drivers of performance in biometric identity verificationsystems are multimodal data from sensors and information processingtechniques to convert that data into useful features. In existingbiometric identity verification systems, the data features are comparedagainst a database to confirm or reject the identity of a particularsubject. Recent research in biometrics has focused on improving datasensors and creating improved feature representations of the sensordata. By deep learning standards, typical biometric datasets are small:Tens to tens of thousands of images is a typical size.

There have been significant recent advances in biometric identityverification systems. However, these methods are still inadequate inchallenging identification environments. For example, in multimediaapplications such as social media and digital entertainment, oneattempts to match Internet face images in social media. However,biometric images these applications usually exhibit dramatic variationsin pose, illumination, and expression, which substantially degradeperformance of traditional biometric algorithms. Moreover, multimediaapplications present an additional challenge due to the large scale ofthe image databases available, therefore leading to many users andincreased probability of incorrect identification.

Traditional biometric identity verification systems are usually based ontwo-dimensional biometric images captured in the visible light spectrum;these images are corrupted by different environmental conditions such asvaried lighting, camera angles, and resolution. Multispectral imaging,where visible light is combined with other spectra such as near-infraredand thermal, has recently been applied to biometric identityverification systems. This research has demonstrated that multimodalbiometric data fusion can significantly improve the accuracy ofbiometric identity verification systems due to complementary informationfrom multiple modalities. The addition of palmprint data appears toenhance system performance further.

Curated two-dimensional biometric datasets have been created for avariety of biometric tasks. While each dataset contains information foronly a narrowly-defined task, the combination of all datasets wouldenable the creation of a rich knowledge repository. Moreover,combination of existing biometric datasets with the large repository ofunlabeled data on the Internet presents new opportunities andchallenges. For example, while there is an abundance of unlabeled dataavailable, many problems either do not have a labeled training set oronly have a small training set available. Additionally, the creation oflarge, labeled training sets generally requires significant time andfinancial resources.

Recent advances in three-dimensional range sensors and sensor processinghave made it possible to overcome some limitations, such as distortiondue to illumination and pose changes, of two-dimensional biometricidentity system modalities. Three-dimensional multispectral images willprovide more geometric and shape information than their two-dimensionalcounterparts. Using this information, new methods of processingdepth-based images and three-dimensional surfaces will enable theimprovement of biometric identity verification systems.

In light of recent advances, it is desirable:

-   -   i. to explore data representation and feature learning for        applications with multiple data channels;    -   ii. to explore unified processing of multichannel, multimodal        data;    -   iii. to resolve the vanishing gradient problem;    -   iv. to combine human and machine expertise to build optimal data        representations;    -   v. to explore supervised and unsupervised representation        learning;    -   vi. to investigate transferring learning from one or more tasks        or databases to a useful target task;    -   vii. to explore the synergies of two-dimensional and        three-dimensional modalities at the feature level;    -   viii. to explore methods for mapping three-dimensional images        into two-dimensional images to enable the use of existing        two-dimensional image databases;    -   ix. to generate hierarchical data representations of textures        associated with three-dimensional objects; and,    -   x. to employ two-dimensional and three-dimensional data in the        analysis of Internet and multimedia images.

SUMMARY OF THE INVENTION

The present invention(s) include systems, methods, and apparatuses for,or for use in: (i) approximating mathematical relationships between oneor more sets of data and for, or use in, creating hypercomplexrepresentations of data; (ii) transferring learned knowledge from onehypercomplex system to another; (iii) multimodal hypercomplex learning;and, (iv) biometric identity verification using multimodal, multi-sensordata and hypercomplex representations.

The Summary introduces key concepts related to the present invention(s).However, the description, figures, and images included herein are notintended to be used as an aid to determine the scope of the claimedsubject matter. Moreover, the Summary is not intended to limit the scopeof the invention.

In some embodiments, the present invention(s) provide techniques forlearning hypercomplex feature representations of data, such as, forexample: audio; images; biomedical data; biometrics such asfingerprints, palm prints, iris information, facial images, demographicdata, behavioral characteristics, and so on; gene sequences; text;unstructured data; writing; geometric patterns; or any other informationsource. In some embodiments, the data representations may be employed inthe applications of, for example, classification, grading, functionapproximation, system preprocessing, pretraining of other graphstructures, and so on.

Some embodiments of the present invention(s) include, for example,hypercomplex convolutional neural network layers and hypercomplex neuralnetwork layers with internal state elements.

In some embodiments, the present invention(s) allow for supervisedlearning given training data inputs with associated desired responses.The present invention(s) may be arranged in any directed or undirectedgraph structure, including graphs with feedback or skipped nodes. Insome embodiments, the present invention(s) may be combined with othertypes of graph elements, including, for example, pooling, dropout, andfully-connected neural network layers. In some embodiments, the presentinvention includes convolutional hypercomplex layers and/orlocally-connected layers. Polar (angular) representation of hypercomplexnumbers is employed in some embodiments of the invention(s), and otherembodiments may quantize or otherwise non-linearly process the angularvalues of the polar representation.

Some embodiments of the present invention(s) propagate training errorsthrough the network graph using an error-correction learning rule. Someembodiments of the learning rule rely upon multiplication of the errorwith the hypercomplex inverse of hypercomplex weights in the same orother graph elements which, for example, include neural network layers.

In some embodiments of the present invention(s), hypercomplexmathematical operations are performed using real-valued mathematicaloperators and additional mathematical steps. Some embodiments of thehypercomplex layers may include real-valued software libraries that havebeen adapted for use with hypercomplex numbers. Exemplaryimplementations of the present invention(s) to perform highly-optimizedcomputations through real-valued matrix multiply and convolutionroutines. These real-valued routines, for example, run on readilyavailable computer hardware and are available for download. Examplesinclude the Automatically Tuning Linear Algebra Subroutines and NVIDIAcuDNN libraries.

In some embodiments, techniques or applications are fully automated andare performed by a computing device, such as, for example, a centralprocessing unit (CPU), graphics processing unit (GPU), fieldprogrammable gate array (FPGA), and/or application specific integratedcircuit (ASIC).

Some embodiments of the present invention(s) include various graphstructures involving hypercomplex operations. Examples includehypercomplex feedforward networks, networks with pooling, recurrentnetworks (i.e. with feedback), networks where connections skip over onemore layers, networks with state and layers with internal stateelements, and any combination of the aforementioned or other elements.

In some embodiments, the present invention(s) may be employed forsupervised learning and/or unsupervised learning. Through a novelpretraining technique, embodiments of the present invention(s) maycombine knowledge of application-specific features generated by humanexperts with machine learning of features in hypercomplex deep neuralnetworks. Some embodiments of the invention employ both labeled andunlabeled datasets, where a large labeled dataset in a source domain mayassist in classification tasks for an unlabeled, target domain.

Some embodiments of the present invention(s) have applications in broadareas including, for example: image super resolution; imagesegmentation; image quality evaluation; image steganalysis; facerecognition; event embedding in natural language processing; machinetranslation between languages; object recognition; medical applicationssuch as breast cancer mass classification; multi-sensor data processing;multispectral imaging; image filtering; biometric identity verification;and clothing identification.

Some embodiments of the present invention(s) incorporate multimodal datafor biometric identity verification, for example, anatomicalcharacteristics, behavioral characteristics, demographic indicators,and/or artificial characteristics. Some embodiments of the presentinvention(s) learn biometric features on a source dataset, for example adriver's license face database, and apply recognition in a targetdomain, such as social media photographs. Some embodiments of thepresent invention(s) incorporate multispectral imaging and/or palmprintdata as additional modalities. Some embodiments of the presentinvention(s) employ three-dimensional and/or two-dimensional sensordata. Some embodiments of the present invention(s) incorporate unlabeledbiometric data to aid in hypercomplex data representation training.

BRIEF DESCRIPTION OF THE DRAWINGS

Advantages of the present invention will become apparent to thoseskilled in the art with the benefit of the following detaileddescription of embodiments and upon reference to the accompanyingdrawings in which:

FIG. 1 depicts a schematic diagram of a hypercomplex convolutionallayer;

FIG. 2 depicts a schematic diagram of a hypercomplex convolution layer,with ReLU;

FIG. 3 depicts a flowchart representing hypercomplex convolution;

FIG. 4 depicts a flowchart representing quaternion convolution usingsixteen real convolution operations;

FIG. 5 depicts a flowchart representing quaternion convolution usingeight real convolution operations;

FIG. 6 depicts a flowchart representing octonion convolution;

FIG. 7 depicts a visualization of a multi-dimensional convolutionoperation;

FIG. 8 depicts a flowchart representing quaternion angle quantization;

FIG. 9 depicts a schematic diagram of an angle quantization: phi;

FIG. 10 depicts a schematic diagram of an angle quantization: theta;

FIG. 11 depicts a schematic diagram of an angle quantization: psi;

FIG. 12 depicts a flowchart representing an angle quantization example:octonion;

FIG. 13 depicts a schematic diagram of hypercomplex error correctionlearning: error propagation;

FIG. 14 depicts a schematic diagram of hypercomplex convolutional layerpretraining;

FIG. 15 depicts a schematic diagram of hypercomplex pretraining usingexpert system features;

FIG. 16 depicts a schematic diagram of hypercomplex transfer learning;

FIG. 17 depicts a schematic diagram of a hypercomplex layer withstate—first example;

FIG. 18 depicts a schematic diagram of a hypercomplex layer withstate—second example;

FIG. 19 depicts a flowchart representing a hypercomplex feedforwardneural network;

FIG. 20 depicts a flowchart representing a hypercomplex convolutionalneural network with pooling;

FIG. 21 depicts a flowchart representing a hypercomplex convolutionalneural network with feedback: recurrent network;

FIG. 22 depicts a flowchart representing a hypercomplex convolutionalneural network with layer jumps;

FIG. 23 depicts a flowchart representing a hypercomplex neural networkwith multiple parallel filters;

FIG. 24 depicts a flowchart representing a neural network with parallelprocessing and fusion;

FIG. 25 depicts a flowchart representing a neural network with adaptivelayer type selection;

FIG. 26 depicts a flowchart representing an exemplary application:enlarging images

FIG. 27 depicts a representation of super resolution image 1: groundtruth;

FIG. 28 depicts a representation of super resolution image 1:real-valued prediction;

FIG. 29 depicts a representation of super resolution image 1:hypercomplex prediction;

FIG. 30 depicts a representation of super resolution image 2: groundtruth;

FIG. 31 depicts a representation of super resolution image 2:real-valued prediction;

FIG. 32 depicts a representation of super resolution image 2:hypercomplex prediction;

FIG. 33 depicts a flowchart representing a hypercomplex application:image segmentation;

FIG. 34 depicts a flowchart representing an image quality measurement;

FIG. 35 depicts a flowchart representing a hypercomplex application:image steganalysis;

FIG. 36 depicts a flowchart representing a hypercomplex application:face recognition;

FIG. 37 depicts a flowchart representing a hypercomplex application:event embedding;

FIG. 38 depicts a flowchart representing a hypercomplex application:machine translation 1;

FIG. 39 depicts a flowchart representing a hypercomplex application:machine translation 2;

FIG. 40 depicts a flowchart representing a hypercomplex application:unsupervised learning;

FIG. 41 depicts a flowchart representing a hypercomplexapplication—control: learning actions in a system with state;

FIG. 42 depicts a flowchart representing a hypercomplex application:generative models;

FIG. 43 depicts a flowchart representing a hypercomplex application:breast cancer classification or grading with hypercomplex network;

FIG. 44 depicts a flowchart representing a hypercomplex application:breast cancer classification or grading with pre-trained expertfeatures;

FIG. 45 depicts a flowchart representing a hypercomplex application:breast cancer classification or grading with pre-trained expert featuresand postprocessing;

FIG. 46 depicts a flowchart representing a hypercomplex application:magnetic resonance image processing to classify cancer tumors;

FIG. 47 depicts a flowchart representing a hypercomplex application:multi-sensor audio data processing;

FIG. 48 depicts a flowchart representing a hypercomplex application:multispectral image processing;

FIG. 49 depicts a flowchart representing a hypercomplex application:multispectral image processing for fruit firmness and soluble solidscontent prediction;

FIG. 50 depicts a flowchart representing a hypercomplex application:color image filtering;

FIG. 51 depicts a flowchart representing a hypercomplex application:gray level image processing;

FIG. 52 depicts a flowchart representing a hypercomplex application:enhanced color image processing or classification;

FIG. 53 depicts a flowchart representing a hypercomplex application:multimodal biometric identity verification;

FIG. 54 depicts a flowchart representing a hypercomplex application:multimodal biometric identity verification with autoencoder;

FIG. 55 depicts a flowchart representing a hypercomplex application:multimodal biometric identity verification with transfer learning;

FIG. 56 depicts a flowchart representing a hypercomplex application:multimodal biometric identity verification with missing modalities;

FIG. 57 depicts a flowchart representing a hypercomplex application:attribute classification of clothing using labeled and unlabeled data;

FIG. 58 depicts a flowchart representing a hypercomplex application:transfer learning to match “in the street” clothing with “in store”photographs; and

FIG. 59 depicts a flowchart representing hypercomplex training of aneural network.

FIG. 60 depicts an illustration of Equation 28.

While the invention may be susceptible to various modifications andalternative forms, specific embodiments thereof are shown by way ofexample in the drawings and will herein be described in detail. Thedrawings may not be to scale. It should be understood, however, that thedrawings and detailed description thereto are not intended to limit theinvention to the particular form disclosed, but to the contrary, theintention is to cover all modifications, equivalents, and alternativesfalling within the spirit and scope of the present invention as definedby the appended claims.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

It is to be understood the present invention is not limited toparticular devices or methods, which may, of course, vary. It is also tobe understood that the terminology used herein is for the purpose ofdescribing particular embodiments only, and is not intended to belimiting. As used in this specification and the appended claims, thesingular forms “a”, “an”, and “the” include singular and pluralreferents unless the content clearly dictates otherwise. Furthermore,the word “may” is used throughout this application in a permissive sense(i.e., having the potential to, being able to), not in a mandatory sense(i.e., must). The term “include,” and derivations thereof, mean“including, but not limited to.” The term “coupled” means directly orindirectly connected.

The Detailed Description will be set forth according to the followingoutline:

1 Mathematical Definition

-   -   1.1 Hypercomplex layer introduction    -   1.2 Quaternion convolution    -   1.3 Octonion Convolution    -   1.4 Quaternion convolution for neural networks    -   1.5 Locally-connected layers    -   1.6 Hypercomplex polar form conversion    -   1.7 Activation function angle quantization    -   1.8 Hypercomplex layer learning rule    -   1.9 Hypercomplex error propagation    -   1.10 Unsupervised learning for hypercomplex layer pre-training        and deep belief networks    -   1.11 Hypercomplex layer pre-training using expert features    -   1.12 Hypercomplex transfer learning    -   1.13 Hypercomplex layer with state    -   1.14 Hypercomplex tensor layer    -   1.15 Hypercomplex dropout    -   1.16 Hypercomplex pooling

2 Exemplary implementations of hypercomplex layer

-   -   2.1 Convolution implementations: GEMM        -   2.1.1 Using 16 large real GEMM calls        -   2.1.2 Using 8 large real GEMM calls    -   2.2 Convolution Implementations: cuDNN    -   2.3 Theano implementation    -   2.4 GPU and CPU implementations    -   2.5 Phase-based activation and angle quantization implementation

3 Exemplary hypercomplex deep neural network structures

-   -   3.1 Feedforward neural network    -   3.2 Neural network with pooling    -   3.3 Recurrent neural network    -   3.4 Neural network with layer jumps    -   3.5 State layers    -   3.6 Parallel Filter Sizes    -   3.7 Parallel Graphs    -   3.8 Combinations of Hypercomplex and Other Modules    -   3.9 Hypercomplex layers in a graph

4 Exemplary applications

-   -   4.1 Image super resolution    -   4.2 Image segmentation    -   4.3 Image quality evaluation    -   4.4 Image steganalysis    -   4.5 Face recognition    -   4.6 Natural language processing: event embedding    -   4.7 Natural language processing: machine translation    -   4.8 Unsupervised learning: object recognition    -   4.9 Control systems    -   4.10 Generative Models    -   4.11 Medical imaging: breast mass classification    -   4.12 Medical Imaging: MRI    -   4.13 Hypercomplex processing of multi-sensor data    -   4.14 Hypercomplex multispectral image processing and prediction    -   4.15 Hypercomplex image filtering    -   4.16 Hypercomplex processing of gray level images    -   4.17 Hypercomplex processing of enhanced color images    -   4.18 Multimodal biometric identity matching    -   4.19 Multimodal biometric identity matching with autoencoder    -   4.20 Multimodal biometric identity matching with unlabeled data    -   4.21 Multimodal biometric identity matching with transfer        learning    -   4.22 Clothing identification

1 Mathematical Definition

A hypercomplex neural network layer is defined presently. In theinterest of clarity, a quaternion example is defined below. However, itis to be understood that the method described is applicable to anyhypercomplex algebra, including but not limited to biquaternions,exterior algebras, group algebras, matrices, octonions, and quaternions.

1.1 Hypercomplex Layer Introduction

An exemplary hypercomplex layer is shown in FIG. 1 . The layer takes ahypercomplex input, performs convolution using a hypercomplex kernel andtemporarily stores the result, applies an activation function, andreturns a final quaternion output.

Mathematically, in the quaternion case, these steps are defined asfollows:

Convolution Step:

Let α∈

^(m×n) denote the input to the quaternion layer. The first step,convolution, produces the output s∈

^(r×t), as defined in Equation 1:

$\begin{matrix}{{{s\left( {x,y} \right)} = {\sum\limits_{u = {- \infty}}^{\infty}{\sum\limits_{v = {- \infty}}^{\infty}{{k\left( {{x - u},{y - v}} \right)} \times {a\left( {u,v} \right)}}}}},} & 1\end{matrix}$

where k∈

^(p×q) represents the convolution filter kernel and the x symbol denotesquaternion multiplication. An alternative notation for hypercomplexconvolution is the asterisk, where s=k*_(h) a. Details abouthypercomplex convolution are explained in Sections 1.2 and 1.4.

Activation Step:

Continuing with the quaternion example, the activation function isapplied to s∈

^(r×t) and may be any mathematical function. An exemplary function usedherein is a nonlinear function that converts the quaternion values intopolar (angular) representation, quantizes the phase angles, and thenrecomputes updated quaternion values on an orthonormal basis (1, i, j,k). More information about this function is provided in Sections 1.6 and1.7. FIG. 2 demonstrates another exemplary activation function, theRectified Linear Unit (ReLU), a real-valued function applied to eachcomponent of a quaternion number.

1.2 Quaternion Convolution

Exemplary methods for hypercomplex convolution are described presently.The examples described herein all pertain to quaternion convolution.

A quaternion example of hypercomplex convolution is pictured in FIG. 3 ,where a hypercomplex kernel k∈

^(p×q) is convolved with a hypercomplex input a∈

^(m×n). The process for convolution is to extract the real-valuedcoefficients from k and a, perform some set of operations on thecoefficients, and to finally recombine the coefficients into an outputs∈

^(r×t).

A specific example of the approach described above is shown in FIG. 4 .It should be noted that, by the distributive property, the product oftwo quaternion numbers q₀∈

and q₁∈

is:

$\begin{matrix}{{q_{0}q_{1}} = {{\left( {a_{0} + {b_{0}i} + {c_{0}j} + {d_{0}k}} \right)\left( {a_{1} + {b_{1}i} + {c_{1}j} + {d_{1}k}} \right)} = {{a_{0}a_{1}} - {b_{0}b_{1}} - {c_{0}c_{1}} - {d_{0}d_{1}} + {\left( {{a_{0}b_{1}} + {b_{0}a_{1}} + {c_{0}d_{1}} - {d_{0}c_{1}}} \right)i} + {\left( {{a_{0}c_{1}} - {b_{0}d_{1}} + {c_{0}a_{1}} + {d_{0}b_{1}}} \right)j} + {\left( {{a_{0}d_{1}} + {b_{0}c_{1}} - {c_{0}b_{1}} + {d_{0}a_{1}}} \right)k}}}} & 2\end{matrix}$

Because convolution is a linear operator, the multiplication in Equation2 may be replaced by the convolution operator of Equation 1.Correspondingly, a convolution algorithm to compute s(x,y)=k(x,y)*a(x,y)for quaternion matrices is shown in Equation 3:

$\begin{matrix}{{{{k\left( {x,y} \right)}*{a\left( {x,y} \right)}} = {{\left( {{a_{k}\left( {x,y} \right)} + {{b_{k}\left( {x,y} \right)}i} + {{c_{k}\left( {x,y} \right)}j} + {{d_{k}\left( {x,y} \right)}k}} \right)*_{h}\left( {{a_{a}\left( {x,y} \right)} + {{b_{a}\left( {x,y} \right)}i} + {{c_{a}\left( {x,y} \right)}j} + {{d_{a}\left( {x,y} \right)}k}} \right)} = {{{a_{k}\left( {x,y} \right)}*_{r}{a_{a}\left( {x,y} \right)}} - {{b_{k}\left( {x,y} \right)}*_{r}{b_{a}\left( {x,y} \right)}} - {{c_{k}\left( {x,y} \right)}*_{r}{c_{a}\left( {x,y} \right)}} - {{d_{k}\left( {x,y} \right)}*_{r}{d_{a}\left( {x,y} \right)}} + {\begin{pmatrix}{{{a_{k}\left( {x,y} \right)}*_{r}{b_{a}\left( {x,y} \right)}} + {{b_{k}\left( {x,y} \right)}*_{r}{a_{a}\left( {x,y} \right)}} +} \\{{{c_{k}\left( {x,y} \right)}*_{r}{d_{a}\left( {x,y} \right)}} - {{d_{k}\left( {x,y} \right)}*_{r}c_{a({x,y})}}}\end{pmatrix}i} + {\begin{pmatrix}{{{a_{k}\left( {x,y} \right)}*_{r}{c_{a}\left( {x,y} \right)}} - {{b_{k}\left( {x,y} \right)}*_{r}{d_{a}\left( {x,y} \right)}} +} \\{{{c_{k}\left( {x,y} \right)}*_{r}{a_{a}\left( {x,y} \right)}} + {{d_{k}\left( {x,y} \right)}*_{r}{b_{a}\left( {x,y} \right)}}}\end{pmatrix}j} + {\begin{pmatrix}{{{a_{k}\left( {x,y} \right)}*_{r}{d_{a}\left( {x,y} \right)}} + {{b_{k}\left( {x,y} \right)}*_{r}{c_{a}\left( {x,y} \right)}} -} \\{{{c_{k}\left( {x,y} \right)}*_{r}{b_{a}\left( {x,y} \right)}} + {{d_{k}\left( {x,y} \right)}*_{r}{a_{a}\left( {x,y} \right)}}}\end{pmatrix}k}}}},} & 3\end{matrix}$where *_(h) denotes quaternion convolution, *_(r) denotes real-valuedconvolution, and (x,y) are the 2d array indices.

One may observe the Equation 3 requires sixteen real-valued convolutionoperations to perform a single quaternion convolution. However, due tothe linearity of convolution, high-speed techniques for quaternionmultiplication may also be applied to the convolution operation. Forexample, FIG. 5 outlines a quaternion convolution algorithm that onlyrequires eight real-valued convolution operations to perform the samemathematical operation as is shown in Equation 3. Theeight-real-multiply convolution takes inputsk(x,y)=a_(k)(x,y)+b_(k(x,y))i+c_(k(x,y))j+d_(k)(x,y)k anda(x,y)=a_(a)(x,y)+b_(a)(x,y)i+c_(a)(x,y)j+d_(a)(x,y)k, as in Equation 3.However, rather than computing the convolution directly, the followingintermediate variables are computed as a first step:t ₁(x,y)=a _(k)(x,y)*_(r) a _(s)(x,y)t ₂(x,y)=d _(k)(x,y)*_(r)(x,y)t ₃(x,y)=b _(k)(x,y)*_(r) d _(a)(x,y)t ₄(x,y)=c _(k)(x,y)*_(r) b _(a)(x,y)t ₅(x,y)=(a _(k)(x,y)+b _(k)(x,y)+c _(k)(x,y)+d _(k)(x,y))*_(r)(a_(a)(x,y)+b _(a)(x,y)+c _(a)(x,y)+d _(a)(x,y))t ₆(x,y)=(a _(k)(x,y)+b _(k)(x,y)—c _(k)(x,y)—d _(k)(x,y))*_(r)(a_(a)(x,y)+b _(a)(x,y)—c _(a)(x,y)—d _(a)(x,y))t ₇(x,y)=(a _(k)(x,y)−b _(k)(x,y)+c _(k)(x,y)−d _(k)(x,y))*_(r)(a_(a)(x,y)−b _(a)(x,y)+c _(a)(x,y)−d _(a)(x,y))t ₈(x,y)=(a _(k)(x,y)−b _(k)(x,y)−c _(k)(x,y)+d _(k)(x,y))*_(r)(a_(a)(x,y)−b _(a)(x,y)−c _(a)(x,y)+d _(a)(x,y))  4

In Equation 4, *_(r) represents a real-valued convolution; one willobserve that there are eight real-valued convolutions.

To complete the quaternion convolution s(x,y), the temporary terms t_(i)are scaled and summed as shown in Equation 5:

$\begin{matrix}{{{a_{s}\left( {x,y} \right)} = {{2t_{1}} - {\frac{1}{4}\left( {t_{5} + t_{6} + t_{7} + t_{8}} \right)}}}{{b_{s}\left( {x,y} \right)} = {{{- 2}t_{1}} + {\frac{1}{4}\left( {t_{5} + t_{6} - t_{7} - t_{8}} \right)}}}{{c_{s}\left( {x,y} \right)} = {{{- 2}t_{1}} + {\frac{1}{4}\left( {t_{5} - t_{6} + t_{7} - t_{8}} \right)}}}{{d_{s}\left( {x,y} \right)} = {{{- 2}t_{1}} + {\frac{1}{4}\left( {t_{5} - t_{6} - t_{7} + t_{8}} \right)}}}{{s\left( {x,y} \right)} = {{a_{s}\left( {x,y} \right)} + {{b_{s}\left( {x,y} \right)}i} + {{c_{s}\left( {x,y} \right)}j} + {{d_{s}\left( {x,y} \right)}k}}}} & 5\end{matrix}$

1.3 Octonion Convolution

Octonions represent another example of hypercomplex numbers. Octonionconvolution may be performed using quaternion convolution as outlinedpresently:

Let o_(n)∈

:o _(n) =w ₀ o _(0n) +e ₁ o _(1n) +e ₂ o _(2n) +e ₃ o _(3n) +e ₄ o _(4n)+e ₅ o _(5n) +e ₆ o _(6n) +e ₇ o _(7n)  6

To convolve octonions o_(a)*_(o)o_(b), first represent each argument asa pair of quaternions, resulting in w, x, y, z∈

:w=1o _(0a) +io _(1a) +jo _(2a) +ko _(3a)x=1o _(4a) +io _(5a) +jo _(6a) +ko _(7a)y=1o _(0b) +io _(1b) +jo _(2b) +ko _(3b)z=1o _(4b) +io _(5b) +jo _(6b) +ko _(7b)  7

Next, perform quaternion convolution, for example, as described inSection 1.2:s _(L) =w* _(h) y−z* _(h) x*s _(R) =w* _(h) z−y* _(h) x  8

where a superscript * denotes quaternion conjugation and *_(h) denotesquaternion convolution.

Finally, recombine s_(L) and s_(R) to form the final results=o_(a)*_(o)o_(b)∈

:s _(L)=1a _(L) +ib _(L) +jc _(L) +kd _(L)s _(R)=1a _(R) +ib _(R) +jc _(R) +kd _(R)s=e ₀ a _(L) +e ₁ b _(L) +e ₂ c _(L) +e ₃ d _(L) +e ₄ a _(R) +e ₅ b _(R)+e ₆ c _(R) +e ₇ d _(R)  9

The exemplary process of octonion convolution described above is shownin FIG. 6 .

1.4 Quaternion Convolution for Neural Networks

Sections 1.2 and 1.3 describe an examples of hypercomplex convolution oftwo-dimensional arrays. The techniques described above may be employedin multi-dimensional convolution that is typically used for neuralnetwork tasks.

For example, FIG. 7 shows a typical multi-dimensional convolution: Aninput pattern that is a 10×10 array of depth 1 has six two-dimensionalconvolutional filters applied to it. The convolutional filter kernelsare each 3×3 arrays. When performing a convolution where all outputpoints are computed using valid data input values (i.e. no zero padding,sampling stride of 1 in each dimension), the convolution of a single 3×3array with a single 10×10 results in a single array of shape 8×8. Sincethe convolution depicted in FIG. 7 has 6 convolution kernels that arestacked into a 3-dimensional convolution kernel, the final output is ofshape 8×8×6, where each one of the six depth locations represents theoutput of a complete two-dimensional convolution as described in theprior section. This is discussed further below in Section 2.1.

The approach described above may also be extended to input arrays with adepth of larger than 1, thereby causing the filter kernel to become4-dimensional. Conceptually, a loop over the input and output dimensionsmay be performed. The inside of the loop contains two-dimensionalconvolutions as described in the prior section. Note further that eachtwo-dimensional convolution takes input values from all depth levels ofthe input array. If, for example, the 10×10 input array had an inputdepth of two, then each convolution kernel would have 3×3×2 weights,rather than 3×3 weights as described above. Therefore, the typical shapeof a 4-dimensional hypercomplex convolution kernel is (D_(o), D_(i),K_(x), K_(r)), where D_(o) represents the output depth, D_(i) representsthe input depth, and K_(x) and K_(y) are the 2d filter kerneldimensions. Finally, in the present hypercomplex example of quaternions,all of the data points above are quaternion values and all convolutionsare quaternion in nature. As will be discussed in the implementationsections below, existing computer software libraries typically do notsupport quaternion arithmetic. Therefore, an additional dimension may beadded in software to represent the four components (1, i, j, k) of aquaternion.

1.5 Locally-Connected Layers

The convolutions described thus far are sums of local, weighted windowsof the input, where the filter kernel represents a set of shared weightsfor all window locations. Rather than using shared weights in a filterkernel, a separate set of weights may be used for each window location.In this case, one has a locally-connected hypercomplex layer, ratherthan a convolutional hypercomplex layer.

1.6 Hypercomplex Polar Form Conversion

The exemplary quaternion conversion to and from polar (angular) form isdefined presently. Let a single quaternion value output from theconvolution step be denoted as s∈

:s=a+bi+cj+dk  10

The polar conversion representation is shown in Equation 11:s=|s|e ^(iϕ) e ^(jθ) e ^(kψ)  11, where s_(p)∈

represents a single quaternion number and (ϕ, θ, ψ) represent thequaternion phase angles. Note that a term to represent the norm of thequaternion, |s| is intentionally set to one during the polar conversion.

The angles (ϕ, θ, ψ) are calculated as shown in Equation 12:

$\begin{matrix}{{\psi = {- \frac{\sin^{- 1}\left( {{2{bc}} - {2{ad}}} \right)}{2}}}{{{If}\psi} \in {\left\{ {{+ \frac{\pi}{4}},{- \frac{\pi}{4}}} \right\}:}}{\phi = {\frac{1}{2}\tan 2^{- 1}\left( {{2\left( {{cd} + {ab}} \right)},{a^{2} - b^{2} - c^{2} + d^{2}}} \right)}}{\theta = 0}{{Else}:}{\phi = {\frac{1}{2}\tan 2^{- 1}\left( {{2\left( {{cd} + {ab}} \right)},{a^{2} - b^{2} + c^{2} - d^{2}}} \right)}}{\theta = {\frac{1}{2}\tan 2^{- 1}\left( {{2\left( {{bd} + {ac}} \right)},{a^{2} + b^{2} - c^{2} - d^{2}}} \right)}}} & 12\end{matrix}$

The definition of tan 2⁻¹(x,y) is given in Equation 13:

$\begin{matrix}{{\tan 2^{- 1}\left( {x,y} \right)} = \left\{ \begin{matrix}{{{\tan^{- 1}\left( \frac{x}{y} \right)}{if}x} > 0.} \\{{{{\tan^{- 1}\left( \frac{x}{y} \right)} + {\pi{if}y}} \geq 0},{x < 0.}} \\{{{{\tan^{- 1}\left( \frac{x}{y} \right)} - {\pi{if}y}} < 0},{x < 0.}} \\{{{\frac{\pi}{2}{if}y} > 0},{x = 0.}} \\{{{{- \frac{\pi}{2}}{if}y} < 0},{x = 0.}} \\{{undefined},{{{if}y} = 0},{x = 0.}}\end{matrix} \right.} & 13\end{matrix}$

Most software implementations of Equation 13 return zero for the case ofx=0, y=0, rather than returning an error or a Not a Number (NaN) value.

Finally, to convert the quaternion polar form in Equation 11 to thestandard form of Equation 10, one applies Euler's formula as shown inEquation 14:

$\begin{matrix}{s_{u} = {{1\left( {{{\cos(\phi)}{\cos(\psi)}{\cos(\theta)}} + {{\sin(\phi)}{\sin(\psi)}{\sin(\theta)}}} \right)} + {i\left( {{{\sin(\phi)}{\cos(\psi)}{\cos(\theta)}} - {{\cos(\phi)}{\sin(\psi)}{\sin(\theta)}}} \right)} + {j\left( {{{\cos(\phi)}{\cos(\psi)}{\sin(\theta)}} - {{\sin(\phi)}{\sin(\psi)}{\sin(\theta)}}} \right)} + {k\left( {{{\cos(\phi)}{\sin(\psi)}{\cos(\theta)}} + {{\sin(\phi)}{\cos(\psi)}{\sin(\theta)}}} \right)}}} & 14\end{matrix}$

In Equation 14, s_(u)∈

has the “u” subscript because it is the unit-norm version of ouroriginal variable s∈

. Not restricting |s| to 1 in Equation 10 would result in s_(u)=s.

1.7 Activation Function Angle Quantization

FIG. 8 shows a quaternion example of a hypercomplex layer activationfunction. The exemplary hypercomplex activation function converts aquaternion input s∈

to polar form using Equation 12 to produce a set of angles (ϕ, θ, ψ).These angles are then quantized using a set of pre-determined outputvalues. Each angle is associated with a unique set of output angles, andthose sets may or may not be equal. Examples of differing output anglesets are shown in FIG. 9 , FIG. 10 , and FIG. 11 where input vectors(dotted line, red) s are mapped to output locations y. In thesediagrams, the output values are placed on the unit circle; however, thisneed not be the case and is shown merely as an example of thequantization process. FIG. 12 shows an additional hypercomplex exampleactivation function that employs octonions rather than quaternions.

The quantization process described above creates a set of three newangles, (ϕ_(p), θ_(p), ψ_(p)), as shown in FIG. 8 . The final step is toapply Equation 14 to these angles, resulting in the final activationfunction output y E H.

1.8 Hypercomplex Layer Learning Rule

A learning rule for a single hypercomplex neuron is described presently.A layer of hypercomplex neurons is merely a collection of hypercomplexneurons that run in parallel, therefore the same learning rule appliesto all neurons independently. The output of each neuron forms a singleoutput of a hypercomplex layer. This learning rule is an example of thequaternion case and is not meant to limit the scope of the claims inthis document.

Stochastic gradient descent has gained popularity in the machinelearning community for training real-valued neural network layers.However, because the activation function described Section 1.7 and shownin FIG. 8 does not have a continuous output, is non-differentiable andtherefore cannot be used with gradient descent methods. Accordingly, anerror-correction methodology is introduced here for neural network layertraining. As with stochastic gradient descent, we assume that the neuralweights are initialized randomly.

A typical error correction weight update rule for a fully-connectedneural network layer is shown in Equation 15:w _(i) ^((k+1)) =w _(i) ^((k))+μ·δ_(i) ·x _(i)15

Equation 15 represents training cycle k of a neuron, where x_(i) is theinput value to the weight, δ_(i) is the related to the current trainingerror, μ is the learning rate, w_(i) ^((k)) is the current value of theweight, w_(i) ^((k+1)) is the updated value of the weight, and i is theindex of all the weights for the neuron.

To extend the fully-connected error correction rule in Equation 15 toconvolutional layers, the multiplication between δ_(i) and x _(i) willbe replaced by a quaternion convolution. However, before performing thisreplacement, the values of δ_(i) ∀ i must be derived.

Returning to the fully-connected example, the goal of training a neuronis to have its output, y^((k+1)) equal some desired response value of d∈

. Solving for δ_(i):

$\begin{matrix}\begin{matrix}{d = y^{({k + 1})}} \\{= {X^{T} \cdot W^{({k + 1})}}} \\{= {\sum\limits_{i}{x_{i} \cdot w_{i}^{({k + 1})}}}} \\{= {\sum\limits_{i}{x_{i} \cdot \left( {w_{i}^{(k)} + {\delta_{i} \cdot {\overset{\_}{x}}_{i}}} \right)}}} \\{= {{\sum\limits_{i}{x_{i} \cdot w_{i}^{(k)}}} + {x_{i} \cdot \delta_{i} \cdot {\overset{\_}{x}}_{i}}}} \\{= {{\sum\limits_{i}{x_{i} \cdot w_{i}^{(k)}}} + \delta_{i}}} \\{= {{\sum\limits_{i}{x_{i} \cdot w_{i}^{(k)}}} + {\sum\limits_{i}\delta_{i}}}} \\{= {y^{(k)} + {\sum\limits_{i}\delta_{i}}}}\end{matrix} & 16\end{matrix}$

Note that Equation 16 employs the relationship:√{square root over (x·x )}=√{square root over ( x·x)}=∥x∥ for x∈

,

, or

  17

where ∥x∥ is the 2-norm of x and x is the quaternion conjugate of x.

Further note that the norm of each input is assumed to equal one, as thenorm of the output from the activation function in Sections 1.6 and 1.7is equal to one. This is not a restriction of the algorithm, as weightupdates in the learning algorithm (discussed in Sections 1.8 and 1.9)may be scaled to account for non-unit-norm inputs, and/or inputs may bescaled to unit norm. Equation 15 clearly has more unknowns thanvariables and therefore does not have a solution. Accordingly, theauthors assume without proof that each neural weight shares equalresponsibility for the final error, and that all of the δ_(i) variableshould equal one another. This leads to the following weight updaterule, assuming that there are n weights:

$\begin{matrix}{w_{i}^{({k + 1})} = {w_{i}^{(k)} + {\frac{\mu}{n} \cdot \left( {d - y^{(k)}} \right) \cdot {\overset{\_}{x}}_{i}}}} & 18\end{matrix}$

In equation 18, k represents the current training cycle, x_(i) is theinput value to the weight, n is the number of neurons, μ is the learningrate, w_(i) ^((k)) is the current value of the weight, w_(i) ^((k+1)) isthe updated value of the weight, and i is the index of all the weightsfor the neuron.

Extending this approach to convolution and multiple neurons, the newweight update rule is:

$\begin{matrix}{{W^{({k + 1})} = {W^{(k)} + {{\frac{\mu}{n} \cdot \left( {D - Y^{(k)}} \right)}*{\overset{\_}{X}}_{i}}}},} & 19\end{matrix}$where W^((k+1)) is the updated weight array, W^((k)) is the currentweight array, n is the number of neurons, μ is the learning rate, D isthe desired response vector, Y^((k)) is the current response vector, *represents quaternion convolution, and X _(i) is the quaternionconjugate of the input.

1.9 Hypercomplex Error Propagation

This section provides an example of error propagation betweenhypercomplex layers to enable learning in multi-layer graph structures.This particular example continues the quaternion example of Section 1.8.Following the approach in Section 1.8, the multi-layer learning rulewill be derived using a fully-connected network, and then, by linearityof convolution, the appropriate hypercomplex multiplication will beconverted into a quaternion convolution operation.

FIG. 13 shows an example of a two-layer, fully-connected neural network,with each layer containing two neurons. Layer A is on the left hand sideand Layer B is on the right hand side of the figure. This particularexample includes neural inputs that are of constant value 1 and arereferred to as “bias weights.”

As described in Section 1.8, the neurons each use an error correctionrule for learning and, consequently, cannot be used with the gradientdescent learning methods that are popular in existing machine learningliterature.

Following the approach in Section 1.8, we set the output of the neuralnetwork equal to the desired response d and solve for the error termsδ_(i):

$\begin{matrix}\begin{matrix}{d = y^{({k + 1})}} \\{= {X_{B}^{T} \cdot W_{B}^{({k + 1})}}} \\{= {w_{B_{0}}^{(k)} + {\sum\limits_{i}{x_{B_{i}}^{({k + 1})} \cdot w_{B_{i}}^{({k + 1})}}}}} \\{= {w_{B_{0}}^{(k)} + \delta_{B_{0}} + {\sum\limits_{i = 1}^{n - 1}{x_{B_{i}}^{({k + 1})} \cdot \left( {w_{B_{i}}^{(k)} + {\delta_{B_{i}}\overset{\_}{x_{B_{\iota}}}}} \right)}}}} \\{= {w_{B_{0}}^{(k)} + \delta_{B_{0}} + {\sum\limits_{i = 1}^{n - 1}{x_{B_{i}}^{({k + 1})} \cdot w_{B_{i}}^{(k)}}} + \delta_{B_{i}}}} \\{= {w_{B_{0}}^{(k)} + \delta_{B_{0}} + {\sum\limits_{i = 1}^{n - 1}\delta_{B_{i}}} + {\sum\limits_{i = 1}^{n - 1}{x_{B_{i}}^{({k + 1})} \cdot w_{B_{i}}^{(k)}}}}} \\{= {w_{B_{0}}^{(k)} + {\sum\limits_{i = 0}^{n - 1}\delta_{B_{i}}} + {\sum\limits_{i = 1}^{n - 1}{x_{B_{i}}^{({k + 1})} \cdot w_{B_{i}}^{(k)}}}}} \\{= {w_{B_{0}}^{(k)} + {\sum\limits_{i = 0}^{n - 1}\delta_{B_{i}}} + {\sum\limits_{i = 1}^{n - 1}{\left( {\sum\limits_{j = 0}^{m - 1}{x_{A_{j}} \cdot w_{A_{j}}^{({k + 1})}}} \right) \cdot w_{B_{i}}^{(k)}}}}} \\{= {w_{B_{0}}^{(k)} + {\sum\limits_{i = 0}^{n - 1}\delta_{B_{i}}} + {\sum\limits_{i = 1}^{n - 1}{\left( {\sum\limits_{j = 0}^{m - 1}{x_{A_{j}} \cdot \left( {w_{A_{j}}^{(k)} + {\delta_{A_{j}} \cdot \overset{\_}{x_{A_{j}}}}} \right)}} \right) \cdot w_{B_{i}}^{(k)}}}}} \\{= {w_{B_{0}}^{(k)} + {\sum\limits_{i = 0}^{n - 1}\delta_{B_{i}}} + {\sum\limits_{i = 1}^{n - 1}{\left( {{\sum\limits_{j = 0}^{m - 1}{x_{A_{j}} \cdot w_{A_{j}}^{(k)}}} + \delta_{A_{j}}} \right) \cdot w_{B_{i}}^{(k)}}}}} \\{= {w_{B_{0}}^{(k)} + {\sum\limits_{i = 0}^{n - 1}\delta_{B_{i}}} + {\sum\limits_{i = 1}^{n - 1}{w_{B_{i}}^{(k)}{\sum\limits_{j = 0}^{m - 1}{x_{A_{j}} \cdot w_{A_{j}}^{(k)}}}}} + {\sum\limits_{i = 1}^{n - 1}{w_{B_{i}}^{(k)}{\sum\limits_{j = 0}^{m - 1}\delta_{A_{j}}}}}}} \\{= {y_{B}^{(k)} + {\overset{n - 1}{\sum\limits_{i = 0}}\delta_{B_{i}}} + {\sum\limits_{i = 0}^{n - 1}{\sum\limits_{j = 0}^{m - 1}{w_{B_{i}}^{(k)}\delta_{A_{j}}}}}}}\end{matrix} & 20\end{matrix}$

Solving for the training error in terms of δ_(A) _(j) and δ_(B) _(i) :

$\begin{matrix}{{d - y^{(k)}} = {{\sum\limits_{i = 0}^{n - 1}\delta_{B_{i}}} + {\sum\limits_{i = 0}^{n - 1}{\sum\limits_{j = 0}^{m - 1}{w_{B_{i}}^{(k)}\delta_{A_{j}}}}}}} & 21\end{matrix}$

As in the single-neuron case, there is more than one solution toEquation 21. In order to resolve this problem, the assumption is thateach neural network layer contributes equally to the final networkoutput error, implying:

$\begin{matrix}{{\sum\limits_{i = 0}^{n - 1}\delta_{B_{i}}} = {\sum\limits_{i = 0}^{n - 1}{\sum\limits_{j = 0}^{m - 1}{w_{B_{i}}^{(k)}\delta_{A_{j}}}}}} & 22\end{matrix}$

Accordingly, errors are propagated through the graph (or network) byscaling the errors by the hypercomplex multiplicative inverse of theconnecting weights.

To propagate the error from the output of a layer to its inputs:

$\begin{matrix}{{e_{l - 1} = {\frac{1}{N_{l - 1}} \cdot \left\lbrack w_{l}^{{(k)}^{- 1}} \right\rbrack^{T} \cdot e_{l}}},} & 23\end{matrix}$where e_(l) represents the error at the current layer (or the networkerror at the output layer), [w_(l) ^((k)) ⁻¹ ]^(T) is the transposedhypercomplex elementwise inverse of the current layer's weights attraining cycle (k), N_(l-1) is the number of inputs to the l−1st layer,and e_(l-1) is the new error term for the l−1st layer. In Equation 23,the layers are arranged from l∈[0,L], with the Lth layer correspondingto the output layer and the 0th layer corresponding to the input layer.The error terms e_(l) are computed from output to input, thereforepropagating the error backward through the network.

Once the error terms e_(l) have been computed, the weight update issimilar to the single-layer case of Section 1.8:

$\begin{matrix}{{W_{l}^{({k + 1})} = {W_{l}^{(k)} + {{\frac{\mu}{n} \cdot e_{l}^{(k)}}*_{h}{\overset{\_}{X}}_{l - 1}}}},} & 24\end{matrix}$where n represents the number of neurons in layer l, and *_(h)represents hypercomplex convolution.

For inputs that are not of unit-norm, Equation 24 may be modified toscale the weight updates:

$\begin{matrix}{W_{l}^{({k + 1})} = {W_{l}^{(k)} + {{\frac{\mu}{n} \cdot e_{l}^{(k)}}*_{h}\left\lbrack X_{l - 1}^{- 1} \right\rbrack^{T}}}} & 25\end{matrix}$

1.10 Unsupervised Learning for Hypercomplex Layer Pre-Training and DeepBelief Networks

The hypercomplex learning algorithms of Sections 1.8 and 1.9 bothpresume that the initial weights for each layer are selected randomly.However, this need not be the case. For example, FIG. 14 shows anexemplary pre-training method.

Suppose there is a multi-layer hypercomplex neural network in which theinput layer is “Layer A”, the first hidden layer is “Layer B”, the nextlayer is “Layer C”, and so on. Unsupervised learning of Layer A may beperformed by removing it from the network, attaching a fully-connectedlayer, and training the new structure as an autoencoder, which meansthat the desired response is equal to the input. A diagram of anexemplary hypercomplex autoencoder has been drawn in FIG. 14 . Once theautoencoder is finished training, the output of Layer A will be a sparserepresentation of the input data, and therefore is likely to be a usefulrepresentation for the final task of the original multi-layerhypercomplex network.

One may use the pre-trained output from Layer A to pre-train Layer B inthe same manner. This is shown in the lower half of FIG. 14 .

Once pre-training of all layers is complete, the hypercomplex weightsettings of each layer may be copied to the original multi-layernetwork, and fine-tuning of the weight parameters may be performed usingany appropriate learning algorithm, such as those developed in Sections1.8 and 1.9.

This approach is superior to starting with random weights andpropagating errors for two reasons: First, it allows one to use largequantities of unlabeled data for the initial pre-training, therebyexpanding the universe of useful training data; and, second, sinceweight adjustment through multi-layer error propagation takes manytraining cycles, the pre-training procedure significantly reducesoverall training time by reducing the workload for the error propagationalgorithm.

Moreover, the autoencoders described above may be replaced by any otherunsupervised learning method, for example, restricted Boltzmann machines(RBMs). Applying contrastive divergence to each layer, from the lowestto the highest, results in a hypercomplex deep belief network.

1.11 Hypercomplex Layer Pre-Training Using Expert Features

Historically, multi-layer perceptron (MLP) neural network classificationsystems involve the following steps: Input data, such as images, areconverted to a set of features, for example, local binary patterns. Eachfeature, which, for example, takes the form of a real number, is stackedinto a feature vector that is associated with a particular inputpattern. Once the feature vector for each input pattern is computed, thefeature vectors and system desired response (e.g. classification) valuesare presented to a fully-connected MLP neural network for training. Manyhave observed that overall system performance is highly dependent uponthe selection of features, and therefore domain experts have spentextensive time engineering features for these systems.

One may think of hypercomplex convolutional networks as a way toalgorithmically learn hypercomplex features. However, it may bedesirable to incorporate expert features into the system as well. Anexemplary method for this is shown in FIG. 15 where a neural networklayer or layers is pre-trained using the output from one or more expertfeatures as the desired response. Note that one would typically use thismethod with a subset of layers, for example, the hypercomplexconvolutional layers, of a larger hypercomplex graph.

As with the unsupervised pre-training method described in Section 1.10,the hypercomplex layer or layer is pre-trained and then its weights arecopied back to the original hypercomplex graph. Learning rules, forexample those of Sections 1.8 and 1.9, may then be applied to the entiregraph to fine tune the weights.

Using this method, one can start with feature mappings defined by domainexperts and then improve the mappings further with hypercomplex learningtechniques. Advantages include use of domain-expert knowledge andreduced training time.

1.12 Hypercomplex Transfer Learning

Deep learning algorithms perform well in systems where an abundance oftraining data is available for use in the learning process. However,many applications do not have an abundance of data; for example, manymedical classification tasks do not have large databases of labeledimages. One solution to this problem is transfer learning, where otherdata sets and prediction tasks are used to expand the trained knowledgeof the hypercomplex neural network or graph.

An illustration of transfer learning is shown in FIG. 16 . The top halfof the figure shows traditional learning systems, where each system istrained to perform a single task and creates a set of knowledge. In thecontext of hypercomplex networks, the knowledge would correspond withthe weight settings for each layer. The bottom half of FIG. 16 showstransfer learning, where each of the “other systems” is trainedseparately on various tasks, and then the knowledge is integrated intothe trained knowledge of the target system in which we are interested.

There are numerous ways of performing the knowledge transfer, thoughmethods will generally seek to represent the “other tasks” and targettask in the same feature space through an adaptive (and possiblynonlinear) transform, e.g., using a hypercomplex neural network.

1.13 Hypercomplex Layer with State

An exemplary neural network layer is discussed in 1.1 and shown in FIG.1 ; this layer is a feedforward convolutional layer. Another importantexample is a neural network layer with state, which is shown in FIG. 17. This layer combines a hypercomplex input with an internal state,performs hypercomplex operations (for example, convolution), stores andoutputs a result, optionally performs additional mathematicaloperations, and finally updates its internal state.

FIG. 18 shows another example of internal state, where the neuron outputis a function of the state variable(s) directly rather than beingcomputed at an intermediate step during processing.

1.14 Hypercomplex Tensor Layer

The hypercomplex tensor layer enables evaluation of whether or nothypercomplex vectors are in a particular relationship. An exemplaryquaternion layer is defined in Equation 26.

$\begin{matrix}{y = {{g\left( {a,R,b} \right)} = {u_{R}^{T} \cdot {f\left( {{a^{T} \cdot W_{R}^{\lbrack{1:k}\rbrack} \cdot b} + {V_{R} \cdot \begin{bmatrix}a \\b\end{bmatrix}} + b_{R}} \right)}}}} & 26\end{matrix}$

In Equation 26, a∈

^(d×1) and b∈

^(d×1) represent quaternion input vectors to be compared. There are krelationships R that may be established between a and b, and eachrelationship has its own hypercomplex weight array W_(R) ^([i])∈

^(d×d), where 1≤i≤k (and W_(R) ^([1:k])∈

^(d×d×k)). The output of a^(T)·W_(R) ^([1:k])·b is computed by slicingW_(R) ^([1:k]) for each value of [1, k] and performing two-dimensionalhypercomplex matrix multiplication. Furthermore, the hypercomplex weightarray V_(R)∈

^(k×2d) that acts as a fully-connected layer. Finally, a set ofhypercomplex bias weights b_(R)∈

^(k×1) may optionally be present.

The function f is the layer activation function and may, for example, bethe hypercomplex angle quantization function of Section 1.7. Finally,the weight vector u_(R)∈

^(k×1) is transposed and multiplied to create the final output y E H.When an output vector in

^(k×1) is desired rather than a single quaternion number, themultiplication with u_(R) ^(T) may be omitted.

A major advantage to hypercomplex numbers in this layer structure isthat hypercomplex multiplication is not commutative, which helps thelearning structure understand that g(a, R, b)≠g(b, R, a). Since manyrelationships are nonsymmetrical, this is a useful property. Forexample, the relationship, “The branch is part of the tree,” makessense, whereas the reverse relationship, “The tree is part of thebranch,” does not make sense. Moreover, due to the hypercomplex natureof a and b, one can compare relationships between tuples rather thanonly single elements.

1.15 Hypercomplex Dropout

In order to prevent data overfitting, dropout operators are frequentlyfound in neural networks. A real-valued dropout operator sets eachelement in a real-valued array to zero with some nonzero (but usuallysmall) probability. Hypercomplex dropout operators may also be employedin hypercomplex networks, again to prevent data overfitting. However, ina hypercomplex dropout operator, if one of the hypercomplex components(e.g. (1, i, j, k) in the case of a quaternion) is set to zero, then allother components must also be zero to preserve inter-channelrelationships within the hypercomplex value. If unit-norm output isdesired or required, the real component may assigned a value of one andall other components may be assigned a value of zero.

1.16 Hypercomplex Pooling

It is frequently desirable to downsample data within a hypercomplexnetwork using a hypercomplex pooling operation. An exemplary method forperforming this operation is to take a series of local windows, apply afunction to each window (e.g. maximum function), and finally representthe data using only the function's output for each window. Inhypercomplex networks, the function will, for example, take argumentsthat involve all hypercomplex components and produce an output thatpreserves inter-channel hypercomplex relationships.

2 Exemplary Implementations of Hypercomplex Layer

A major difficulty in the practical application of learning systems isthe computational complexity of the learning process. In particular, theconvolution step described in Equation 1 typically cannot be computedefficiently via Fast Fourier Transform due to small filter kernel sizes.Accordingly, significant effort has been expended by academia andindustry to optimize real-valued convolution for real-valued neuralnetwork graphs. However, no group has optimized hypercomplex convolutionfor these tasks, nor is there literature on hypercomplex deep learning.This section presents methods for adapting real-valued computationaltechniques to hypercomplex problems. The techniques in this section arecritical to making the hypercomplex systems described in Section 1practical for engineering use.

2.1 Convolution Implementations: GEMM

One approach to computing the hypercomplex convolution of Equation 1 isto write a set of for loops that shift the hypercomplex kernel to everyappropriate position in the input image, perform a small set ofmultiplications and additions, and then continue to the next position.While such an approach would theoretically work, modern computerprocessors are optimized for large matrix-matrix multiplicationoperations, rather than multiplication between small sets of numbers.Correspondingly, the approach presented in this subsection reframes thehypercomplex convolution problem as a large hypercomplex multiplication,and then explains how to use highly-optimized, real-valuedmultiplication libraries to complete the computation. Examples of suchreal-valued multiplication libraries include: Intel's Math KernelLibrary (MKL); Automatically Tuned Linear Algebra Software (ATLAS); andNvidia's cuBLAS. All of these are implementations of a Basic LinearAlgebra Subprograms (BLAS) library, and all provide matrix-matrixmultiplication functionality through the GEMM function.

In order to demonstrate a hypercomplex neural network convolution, anexample of the quaternion case is discussed presently. To aid thediscussion, define the following variables:X=inputs of shape(G,D _(i),4,X _(i) ,Y _(i))A=filter kernel of shape(D _(o) ,D _(i),4,K _(x) ,K _(y))S=outputs of shape(G,D _(o),4,X _(o) ,Y _(o))  27where G is the number of data patterns in a processing batch, D_(i) isthe input depth, D_(o) is the output depth, and the last two dimensionsof each size represent the rows and columns for each variable. Note thatthe variables X, A, and S have been defined to correspond to real-valuedmemory arrays common in modern computers. Since each memory locationholds a single real number and not a hypercomplex number, each variableabove has been given an extra dimension of size 4. This dimension isused to store the quaternion components of the array. The goal of thiscomputation is to compute the quaternion convolution S=A*X. As discussedabove, one strategy is to reshape the A and X matrices such that asingle quaternion matrix multiply could be employed to compute theconvolution. Reshaped matrices A′ and X′ are shown in Equation 28, shownin FIG. 60 .

In A′ of Equation 28, a_(i) are row vectors. The variable i indexes theoutput dimension of the kernel, D₀. Each row vector is of lengthD_(i)·K_(x)·K_(y), corresponding to a filter kernel at all input depths.Observe that the A′ matrix is of depth 4 to store the quaternioncomponents; therefore, A′ may be thought of as a two-dimensionalquaternion matrix.

In X′ of Equation 28, x_(r,s) are column vectors. The variable r indexesthe data input patterns dimension, G. Each column vector is of lengthD_(i)·K_(x)·K_(y), corresponding to a filter kernel at all input depths.Since the filter kernel must be applied to each location in the image,for each input pattern, there are many columns x_(r,s). The s subscriptis to index the filter locations; each input pattern contains M filterlocations, where M equals:M=(X _(i) −K _(x)+1)·(Y _(i) −K _(y)+1)  29

The total number of columns of X′ is equal to G·M. Like A′, X′ is atwo-dimensional quaternion matrix and is stored in three real dimensionsin computer memory.

The arithmetic for quaternion convolution is performed in Equation 28,using a quaternion matrix-matrix multiply function that is describedbelow in Sections 2.1.1 and 2.1.2.

One will observe that the result S′ from Equation 28 is still atwo-dimensional quaternion matrix. This result is reshaped to form thefinal output S.

2.1.1 Using 16 Large Real GEMM Calls

An example of a quaternion matrix multiply routine that may be used tocompute Equation 28 is discussed presently. One approach is to employEquation 2, which computes the product of two quaternions using sixteenreal-valued multiply operations. The matrix form of this equation isidentical to the scalar version in Equation 2, and each multiply may beperformed using a highly-optimized GEMM call to the appropriate BLASlibrary. Moreover, the sixteen real-valued multiply operations may beperformed in parallel if hardware resources permit.

2.1.2 Using 8 Large Real GEMM Calls

Another example of a quaternion matrix multiply routine only requireseight GEMM calls, rather than the sixteen calls of Equation 2. Thismethod takes arguments k(x,y)=a_(k)(x,y)+b_(k(x,y))i+c_(k(x,y))i+d_(k)(x,y)k and a(x,y)=a_(a)(x,y)+b_(a)(x,y)i+c_(a)(x,y)j+d_(a)(x,y)k, andperforms the quaternion operation k·a. The first step is to computeeight intermediate values using the real-valued GEMM call:t ₁(x,y)=a _(k)(x,y)·a _(a)(x,y)t ₂(x,y)=d _(k)(x,y)·c _(a)(x,y)t ₃(x,y)=b _(k)(x,y)·d _(a)(x,y)t ₄(x,y)=c _(k)(x,y)·b _(a)(x,y)t ₅(x,y)=(a _(k)(x,y)+b _(k)(x,y)+c _(k)(x,y)+d _(k)(x,y))·(a_(a)(x,y)+b _(a)(x,y)+c _(a)(x,y)+d _(a)(x,y))t ₆(x,y)=(a _(k)(x,y)+b _(k)(x,y)−c _(k)(x,y)−d _(k)(x,y))·(a_(a)(x,y)+b _(a)(x,y)−c _(a)(x,y)−d _(a)(x,y))t ₇(x,y)=(a _(k)(x,y)−b _(k)(x,y)+c _(k)(x,y)−d _(k)(x,y))·(a_(a)(x,y)−b _(a)(x,y)+c _(a)(x,y)−d _(a)(x,y))t ₈(x,y)=(a _(k)(x,y)−b _(k)(x,y)−c _(k)(x,y)+d _(k)(x,y))·(a_(a)(x,y)−b _(a)(x,y)−c _(a)(x,y)+a _(a)(x,y))  30

To complete the quaternion multiplication s(x,y), the temporary terms t₁are scaled and summed as shown in Equation 31:

$\begin{matrix}{{{a_{s}\left( {x,y} \right)} = {{2t_{1}} - {\frac{1}{4}\left( {t_{5} + t_{6} + t_{7} + t_{8}} \right)}}}{{b_{s}\left( {x,y} \right)} = {{{- 2}t_{1}} + {\frac{1}{4}\left( {t_{5} + t_{6} - t_{7} - t_{8}} \right)}}}{{c_{s}\left( {x,y} \right)} = {{{- 2}t_{1}} + {\frac{1}{4}\left( {t_{5} - t_{6} + t_{7} - t_{8}} \right)}}}{{d_{s}\left( {x,y} \right)} = {{{- 2}t_{1}} + {\frac{1}{4}\left( {t_{5} - t_{6} - t_{7} + t_{8}} \right)}}}{{s\left( {x,y} \right)} = {{a_{s}\left( {x,y} \right)} + {{b_{s}\left( {x,y} \right)}i} + {{c_{s}\left( {x,y} \right)}j} + {{d_{s}\left( {x,y} \right)}k}}}} & 31\end{matrix}$

Because the GEMM calls represent the majority of the compute time, themethod in this section executes more quickly than the method of Section2.1.1.

2.2 Convolution Implementations: cuDNN

One pitfall to the GEMM-based approach described in the prior section isthat formation of the temporary quaternion matrices A′ and X′ ismemory-intensive. Graphics cards, for example those manufactured byNvidia, are frequently used for matrix multiplication. Unfortunately,these cards have a limited onboard memory, and therefore inefficient useof memory is a practical engineering problem.

For real-valued convolution, memory-efficient software such as Nvidia'scuDNN library has been developed. This package performs real-valuedconvolution in a memory- and compute-efficient manner. Therefore, ratherthan using the GEMM-based approach above, adapting cuDNN or anotherconvolution library to hypercomplex convolution may be advantageous.Because, like multiplication, convolution is a linear operation, thealgorithms for quaternion multiplication may be directly applied toquaternion convolution by replacing real-valued multiplication withreal-valued convolution. This lead to Equations 3, 4, and 5 of Section1.2 and is explained in more detail there. The real-valued convolutionsin these equations may be carried out using an optimized convolutionlibrary such as cuDNN, thereby making quaternion convolution practicalon current computer hardware.

2.3 Theano Implementation

The hypercomplex neural network layers described thus far are ideal foruse in arbitrary graph structures. A graph structure is a collection oflayers (e.g. mathematical functions, hypercomplex or otherwise) with aset of directed edges that define the signal flow from each layer to theother layers (and potentially itself). Extensive effort has beenexpended to create open-source graph solving libraries, for example,Theano.

Three key mathematical operations are required to use the hypercomplexlayers with a graph library such as Theano: First, the forwardcomputation through the layer (Sections 1.2 to 1.7); second, a weightupdate must be computed by a learning rule for a single layer (Section1.8); and, third, errors must be propagated through the layer to thenext graph element (Section 1.9). Since these three operations have beenintroduced in this document, it is therefore possible to use thehypercomplex layer in an arbitrary graph structure.

The authors have implemented an exemplary set of Theano operations toenable the straightforward construction of arbitrary graphs ofhypercomplex layers. The authors employ the memory storage layout ofusing a three-dimensional real-valued array to represent atwo-dimensional quaternion array; this method is described further inSection 2.1 in the context of GEMM operations. Simulation results inSection 4 have been produced using the exemplary Theano operations andusing the cuDNN library as described in Section 2.2.

2.4 GPU and CPU Implementations

As has been alluded to throughout this document, computationalefficiency is a key criteria that must be met in order for hypercomplexlayers to be practical in solving engineering challenges. Becauseconvolution and matrix multiplication are computationally intensive, thestandard approach is to run these tasks on specialized hardware, such asa graphics processing unit (GPU), rather than on a general-purposeprocessor (e.g. from Intel). The cuDNN library referenced in Section 2.2is specifically written for Nvidia GPUs. The exemplary implementation ofhypercomplex layers indeed employs the cuDNN library and thereforeoperates on the GPU. However, the computations may be performed on anyother computational device and the implementation discussed here is notmeant to limit the scope or claims of this patent.

2.5 Phase-Based Activation and Angle Quantization Implementation

An important bottleneck in GPU computational performance is the timedelay to transfer data to and from the GPU memory. Because thecomputationally-intensive tasks of convolution and multiplication areimplemented on the GPU in the exemplary software, it is critical thatall other graph operations take place on the GPU to reduce transfer timeoverhead. Therefore, the activation function described in Sections 1.6and 1.7 have also been implemented using the GPU, as have poolingoperators and other neural network functions.

3 Exemplary Hypercomplex Deep Neural Network Structures

Sections 1 and 2 of this document have provided examples of ahypercomplex neural network layers, using quaternion numbers forillustrative purposes. This section discusses exemplary graph structures(i.e. “neural networks”) of hypercomplex layers. The layers may bearranged in any graph structure and therefore have wide applicability tothe engineering problems discussed in Section 4. The Theanoimplementation of the hypercomplex layer, discussed in Section 2.3,allows for construction of arbitrary graphs of hypercomplex layers (andother components) using minimal effort.

3.1 Feedforward Neural Network

FIG. 19 demonstrates the simplest graph of hypercomplex layers. Theoutput of each layer feeds the input to the next layer, until a finaloutput is reached. The layers may be any type of hypercomplex layer,including fully-connected, convolutional, locally-connected, and so on.

3.2 Neural Network with Pooling

The feedforward neural network of FIG. 19 may be combined with a varietyof other neural network layers. For example, a hypercomplex poolingoperator may be added as shown in FIG. 20 . The hypercomplex poolingoperator applies a function to local windows of hypercomplex points,with the goal of reducing the window to a single hypercomplex number.

3.3 Recurrent Neural Network

Graphs of hypercomplex layers also may contain feedback loops, as shownin FIG. 21 .

3.4 Neural Network with Layer Jumps

As shown in FIG. 22 , not all of the hypercomplex layers must beconnected in sequence; the output of one layer may skip one or morelayers before reconnecting to the system.

3.5 State Layers

In the entirety of this document, the term “hypercomplex layer” is meantto encompass any variation of a hypercomplex neural network layer,including, but not limited to, fully-connected layers such as those ofFIG. 13 , convolutional layers such as FIG. 1 , locally-connectedlayers, and layers incorporating state variable such as FIG. 16 and FIG.18 .

3.6 Parallel Filter Sizes

FIG. 23 depicts an exemplary arrangement of hypercomplex layers andother exemplary operations such as pooling and concatenation. Thestructure in FIG. 23 allows one to operate multiple filter sizes inparallel. Moreover, by reducing the number of output channels at thetop-level 1×1 convolution filters, one can reduce the computationaloverhead of the entire structure in the dotted box in FIG. 23 .

3.7 Parallel Graphs

Hypercomplex layers and/or graphs may, for example, be combined inparallel with other structures. The final output of such a system may bedetermined by combining the results of the hypercomplex layers and/orgraph and the other structure using any method of fusion, e.g.averaging, voting systems, maximum likelihood estimation, maximum aposteriori estimation, and so on. An example of this type of system isdepicted in FIG. 24 .

3.8 Combinations of Hypercomplex and Other Modules

Hypercomplex layers, graphs, and/or modules may be combined in serieswith nonhypercomplex components; an example of this is shown in FIG. 25. In this example, each graph layer may be hypercomplex ornonhypercomplex. Each layer has a selection algorithm to determine whattype/dimensionality is appropriate for the layer. The example in FIG. 25employs the sparsity of the layer's weight matrix as a measure forcomponent selection. For example, layers that are “harder” to train,according to the sparsity measure, could be chosen to be hypercomplex,while “easier” layers could be real-valued for computational efficiency.

3.9 Hypercomplex Layers in a Graph

The hypercomplex layers may be arranged in any graph-like structure;Sections 3.1 to 3.8 provide examples but are not meant to limit thearrangement or interconnection of hypercomplex layers. As discussed inSection 2.3, the exemplary quaternion hypercomplex layer hasspecifically been implemented to allow for the creation of arbitrarygraphs of hypercomplex layers. These layers may be combined with anyother mathematical function(s) to create systems.

4 Exemplary Applications

This section provides exemplary engineering applications of thehypercomplex neural network layer.

4.1 Image Super Resolution

An exemplary application of hypercomplex neural networks is image superresolution. In this task, a color image is enlarged such that the imageis represented by a larger number of pixels than the original. Imagesuper resolution is performed by digital cameras, where it is called,“digital zoom,” and has many applications in surveillance, security, andother industries.

Image super resolution may be framed as an estimation problem: Given alow-resolution image, estimate the higher-resolution image. Real-valued,deep neural networks have been used for this task. To use a neuralnetwork for image super resolution, the following steps are performed:To simulate downsampling, full-size original images are blurred using aGaussian kernel; the blurred images are paired with their original,full-resolution sources and used to train a neural network as input anddesired response, respectively; and, finally, new images are presentedto the neural network and the network output is taken to be the enhancedimage.

One major limitation of the above procedure is that real-valued neuralnetworks do not understand color, and therefore most approaches inliterature are limited to grayscale images. We adapt the hypercomplexneural network layer introduced in this patent application to the superresolution application in FIG. 26 . In particular, the quaternionexample of the hypercomplex layer is convenient for describing colorimages. The quaternion's polar form has three angles, which correspondwell to the three color channels in an image.

Each step in FIG. 26 is explained henceforth:

-   -   1. A database of full-size images is created, where each image        is 32×32 pixels;    -   2. Sets of training and testing images are blurred using a        Gaussian kernel. These are the downsampled images that will be        used for training and as test input. Since the processed images        correspond to original images in the image database, one is able        to measure the performance of the neural network predictor with        respect to image quality;    -   3. Create quaternion versions of the 3-dimensional color images        by encoding the three image colors as the angles (ϕ, δ, ψ);    -   4. Convert this polar form to quaternion (1, i, j, k)        representation using Equation 14;    -   5. Using the processed, downsampled training images as input and        the original, full-resolution training images as a desired        response, train the hypercomplex neural network;    -   6. Employ the trained neural network weights to predict the        full-resolution versions of the processed test images;    -   7. Convert the network output back to polar form using Equation        12;    -   8. Assign the angular values (ϕ, θ, ψ) to each image color        channel;    -   9. Compare the predicted test images with the original test        images in the database, according to peak signal-to-noise ratio        (PSNR), a standard image quality evaluation metric.

The above steps were performed using a quaternion neural network with 3layers. The first convolutional layer has the parameters D_(i)=1,D_(o)=64, K_(x)=K_(y)=9. The second layer takes input directly from thefirst and has parameters D_(i)=64, D_(o)=32, K_(x)=K_(y)=1. Finally, thethird layer takes input directly from the second and has parametersD_(i)=32, D_(o)=1, K_(x)=K_(y)=5. For information on parameterdefinitions, please see Section 2.1 of this document. Note that theconvolution operations were performed on all valid points (i.e. no zeropadding), so the prediction image is of size 20×20 color pixels, whichis somewhat smaller than the 32×32 pixel input size.

This experiment was repeated using a real-valued convolutional neuralnetwork with the same parameters. However, when using a real-valuedneural network, the quaternion polar form (Equations 12 and 14) is notused for the real-valued neural networks. Rather, the images arepresented to the network as a depth 3 input, where each input depthcorresponds to one of the colors. Consequently, D_(i)=3 for the firstlayer rather than 1 in the quaternion case, and D_(o)=3 in the lastlayer. This allows the real neural network to process color images, butthe real network does not understand that there is a significantrelationship between the three color channels. Each neural network wastrained using 256 input images. The networks were trained for at least150,000 cycles; in all cases, the training error had reached steadystate before training was deemed complete. An additional set of 2544images was used for testing.

TABLE 1 Neural Network PSNR (Training Set) PSNR (Testing Set)Real-Valued 29.72 dB 30.06 dB Hypercomplex 31.24 dB 31.89 dB

The mean PSNR values are shown for the training and testing datasets inTable 1. Higher PSNR values represent better results, and one canobserve that the hypercomplex network outperforms the real-valuednetwork.

Sample visual outputs from each algorithm are shown in FIG. 27 throughFIG. 32 : FIG. 27 shows an original, ground truth image; FIG. 28 showsthe real-valued neural network prediction; FIG. 29 shows thehypercomplex network prediction; and FIG. 30 through FIG. 32 repeat thisprocess for a second sample image. One can observe that the real-valuedalgorithm has difficulty predicting the orange color of the image and,rather, substitutes green instead. The hypercomplex algorithm introducedin this patent application does not have difficulty with color images.

4.2 Image Segmentation

Another exemplary application of the hypercomplex neural network iscolor image segmentation. Image segmentation is the process of assigningpixels in a digital image to multiple sets, where each set typicallyrepresents a meaningful item in the image. For example, in a digitalimage of airplanes flying, one may want to segment the image into sky,airplane, and cloud areas.

When combined with the pooling operator, real-valued convolutionalneural networks have seen wide application to image classification.However, the shift-invariance caused by pooling makes for poor imagelocalization, which is necessary for image segmentation. One strategy tocombat this problem is to upsample the image during convolution. Suchupsampling is standard practice in signal processing, where one insertszeros between consecutive data values to create a larger image. Thisupsampled image is trained into a hypercomplex neural network.

An exemplary system for hypercomplex image segmentation is shown in FIG.33 . Input images are upsampled, processed through a pre-trainedhypercomplex neural network, smoothed using bi-linear interpolation, andfinally processed using a conditional random field (CRF) to produce afinal image segmentation result.

4.3 Image Quality Evaluation

The hypercomplex networks introduced in this patent may also be used forautomated image quality evaluation.

Image quality analysis is fundamentally a task for human perception, andtherefore humans provide the ultimate ground truth in image qualityevaluation tasks. However, human ranking of images is time-consuming andexpensive, and therefore computational models of human visual perceptionare of great interest. Humans typically categorize images using naturallanguage—i.e. with words like, “good,” or, “bad.” Existing studies haveasked humans to map these qualitative descriptions to a numerical score,for example, 1 to 100. Research indicates that people are not good atthe mapping process, and therefore the step of mapping to a numericalscore adds noise. Finally, existing image quality measurement systemshave attempted to learn the mapping from image to numerical score, andare impeded by the noisy word-to-score mapping described previously.

Another approach is to perform blind image quality assessment, where theimage quality assessment system learns the mapping from image toqualitative description. Thus, a blind image quality assessment systemis effectively a multiclass classifier, where each class corresponds toa word such as, “excellent,” or, “bad.” Such a system is shown in FIG.34 and is another exemplary application of the proposed hypercomplexnetworks.

The first processing step of the system in FIG. 34 computes colorfeatures that are based on human visual perception. Next, the colorfeatures are processed simultaneously using a hypercomplex network. Thenetwork estimates probabilities for each class, and finally the classprobabilities are combined to select the final image quality estimate.

A key feature of the hypercomplex network is its ability to process allcolor channels simultaneously, thereby preserving importantinter-channel relationships. Existing methods either process images ingrayscale or process color channels separately, thereby losing importantinformation.

4.4 Image Steganalysis

Another exemplary application of hypercomplex deep learning is imagesteganalysis. Image steganography is the technique of hiding informationin images by slightly altering the pixel values of the image. Thealteration of pixels is performed such that the human visual systemcannot perceptually see a difference between the original and alteredimage. Consequently, steganography allows for covert communications overinsecure channels and has been used by a variety of terroristorganizations for hidden communications. Popular steganographicalgorithms include HUGO, WOW, and S-UNIWARD.

Methods for detecting steganography in images have been developed.However, all of the methods either process images in grayscale orprocess each color channel separately, thereby losing importantinter-channel relationships. The hypercomplex neural networkarchitecture in this patent application overcomes both limitations; anexample of a steganalysis system is shown in FIG. 35 . In this system, ahypercomplex convolutional neural network is used to developsteganalysis features, and a fully-connected hypercomplex layer (orlayers) is used to perform the final classification task.

4.5 Face Recognition

Due to hypercomplex networks' advantages in color image processing, facerecognition is another exemplary application. One method of facerecognition is shown in FIG. 36 . In this system, the face is firstaligned to a position in two dimensions. It is then mapped onto a 3dsurface through the use of reference points (e.g. eyes, nose, etc.), anda direct frontal image of the 3d face is then formed. Next, the face isprocessed using a hypercomplex network. The network may use anycombination of, for example, convolutional layers, locally-connectedlayers, fully-connected layers, hypercomplex pooling, and/orhypercomplex dropout while processing the image. Additionalnormalization or other processing may be performed after thehypercomplex network processing. Finally, a face recognition measure isoutput for identification.

4.6 Natural Language Processing: Event Embedding

An important task in natural language processing is event embedding, theprocess of converting events into vectors. An example of an event maybe, “The cat ate the mouse.” In this case, the subject O₁ is the cat,the predicate P is “ate”, and the object O₂ is the poor mouse.Open-source software such as ZPar can be used to extract the tuple (O₁,P, O₂) from the sentence, and this tuple is referred to as the event.

However, the tuple (O₁, P, O₂) is still a tuple of words, and machinereading and translation tasks typically require the tuple to berepresented as a vector. An exemplary application of a hypercomplexnetwork for tuple event embedding is shown in FIG. 37 . The first stepfor event embedding is to extract (O₁, P, O₂). Next, word embeddings foreach element of the tuple are computed; there are many standard ways tocompute word embeddings, for example, the Microsoft Distributed MachineLearning Toolkit. Since word embeddings are a sparse feature space,typically (but optionally) feature space reduction will be performed,e.g., using clustering. Next, each element of the generated tuple ofword embeddings may be represented as a hypercomplex angle, using thesame strategy as in Section 4.1. Finally, a single output embeddingvector is obtained through use of the hypercomplex network introduced inthis patent application. In particular, the hypercomplex tensor layerstructure introduced in Section 1.14 is useful for this task.

4.7 Natural Language Processing: Machine Translation

A natural extension of the event embedding example of Section 4.6 ismachine translation: Embedding a word, event, or sentence as a vector,and then running the process in reverse but using neural networkstrained in a different language. The result is a system thatautomatically translates from one written language to another, e.g. fromEnglish to Spanish.

The most general form of a machine translation system is shown in FIG.38 , where a complete sentence is input to a hypercomplex network, anembedding is created, and then a different hypercomplex network outputsthe translated sentence. Note that, due to the sequential nature ofsentences, these hypercomplex networks will typically have stateelements within them, as in FIG. 18 . Moreover, the networks willgenerally be trained together, with the goal of maximizing theprobability of obtaining the correct translation, typically with alog-likelihood cost function.

Another example of a machine translation system is shown in FIG. 39 ,where the event embedding pipeline of Section 4.6 is performed, followedby a reverse set of steps to compute the final translated sentence. Inthis system, each step may be trained separately to allow for morefine-grained control of system performance. As with the system in FIG.37 , this system may also have all of the blocks trained togetheraccording to a single cost function. The choice of training methodologywill be application-specific and may be decided by the user.

4.8 Unsupervised Learning: Object Recognition

The deep neural network graphs described in this patent application aretypically trained using labeled data, where each training input patternhas an associated desired response from the network. However, in manyapplications, extensive sets of labeled are not available. Therefore, away of learning from unlabeled data is valuable in practical engineeringproblems.

An exemplary approach to using unlabeled data is to trainautoassociative hypercomplex neural networks, where the desired responseof the network is the same as the input. Provided that the hypercomplexnetwork has intermediate representations (i.e. layer outputs within thegraph) that are of smaller size than the input, the autoassociativenetwork will create a sparse representation of the input data during thetraining process. This sparse representation can be thought of as a formof nonlinear, hypercomplex principal component analysis (PCA) andextracts only the most “informative” pieces of the original data. Unlikelinear PCA, the hypercomplex network performs this informationextraction in a nonlinear manner that takes all hypercomplex componentsinto account.

An example of autoassociative hypercomplex neural networks is discussedin Section 1.10, where autoassociative structures are employed forpre-training neural network layers.

Additionally, unsupervised learning of hypercomplex neural networks maybe used for feature learning in computer vision applications. Forexample, in FIG. 40 , a method for training an image recognition systemis shown. First, unlabeled images are trained according to thepretraining method of FIG. 14 . Next, a smaller set of images that alsohave labels are used to fine-tune the hypercomplex network. Finally, thehypercomplex network may be used for image classification andrecognition tasks.

4.9 Control Systems

Hypercomplex neural networks and graph structures may be employed, forexample, to control systems with state, i.e. a plant. FIG. 41 shows asimple system where an image is processed using a hypercomplexconvolutional neural network to determine a control action. The actionis input to the system, which then generates a new image for furtherprocessing. Examples of machines following this structure includeautonomous vehicles and aircraft, automated playing of video games, andother tasks where rewards are assigned after a sequence of steps.

4.10 Generative Models

FIG. 42 demonstrates an exemplary application of hypercomplex neuralnetworks where two networks are present: The generator network takesnoise as input and generates a fake image. Next, the discriminatornetwork compares the fake image and a real image to decide which imageis “real.” This prediction amounts to a probability distribution overthe data labels and is then employed for training both networks. Use ofdata generation may enhance the performance of image classificationmodels such as the exemplary model shown in FIG. 33 , FIG. 34 , FIG. 35, or FIG. 36 .

4.11 Medical Imaging: Breast Mass Classification

Another exemplary application hypercomplex networks is breast cancerclassification or severity grading. For this application, use of theword, “classification,” will refer both to applications where cancerpresence is determined on a true or false basis, and will also refer toapplications where the cancer is graded according to any type ofscale—i.e. grading the severity of the cancer.

A simple system for breast cancer classification is shown in FIG. 43 .In this system, the image is first processed to find the general area ofpotential tumors. Each tumor is then segmented using any imagesegmentation method (or, for example, the image segmentation method ofSection 4.2), and finally the segmented image is processed by ahypercomplex network to determine a final cancer classification.

Note that a variety of multi-dimensional mammogram techniques arecurrently under development, and that false color is currently added toexisting mammograms. Therefore, the advantages that a hypercomplexnetwork has when processing color and multi-dimensional data apply tothis example.

Since breast cancer classification has been studied extensively, a largedatabase of expert features has already been developed for use with thisapplication. A further improvement upon the described hypercomplexbreast cancer classification system is shown in FIG. 44 where existingexpert features are pre-trained into the hypercomplex network beforefine-tuning the network with standard learning techniques. This methodof pre-training is described in detail in Section 1.11 and in FIG. 15 .

Finally, additional postprocessing may be performed after the output ofthe hypercomplex network. An example of this is shown in FIG. 45 , whererandom forest processing is applied to potentially further enhanceclassification accuracy.

4.12 Medical Imaging: MM

FIG. 46 depicts an example of processing magnetic resonance image (MRI)data using hypercomplex representations. Once reconstructed from polarsensor data, MM data is typically displayed as gray-level images. Asshown in FIG. 46 , these images may be enhanced using a variety of imageprocessing techniques, including logarithms or any other imageenhancement algorithm. Next, the original and enhanced images may becombined using any fusion technique and, finally, represented usinghypercomplex values. This hypercomplex representation may serve as inputto further image processing and classification systems, for example, thehypercomplex neural networks discussed elsewhere in this document.

4.13 Hypercomplex processing of multi-sensor data

Most modern sensing applications employ multiple sensors, either in thecontext of a sensor array using multiple of the same type of sensor orby using different types of sensors (e.g. as in a smartphone). Becausethese sensors are typically measuring related, or the same, quantity,the output from each sensor is usually related in some way to theoutputs from the other sensors.

Multi-sensor data may be represented by hypercomplex numbers and,accordingly, processed in a unified manner by the introducedhypercomplex neural networks. An example of speech recognition is shownin FIG. 47 . In this system, a speech recognizer is trained using aspeaker who talks close to the microphone and therefore has excellentsignal to noise ratio. The person's speech is converted to cepstrumcoefficients, input to the speech recognizer, and then converted totext.

The goal is to perform similar speech recognition on a speaker who isfar away from a microphone array. To accomplish this, a microphone arraycaptures far away speech and represents it using hypercomplex numbers.Next, a deep hypercomplex neural network (possibly including stateelements) is trained to output corrected cepstrum coefficients by usingthe close talking input and its cepstrum converter to created labeledtraining data. Finally, during the recognition phase, the close talkingmicrophone can be disabled completely and the hypercomplex neuralnetwork feeds the speech recognizer directly, delivering speechrecognition quality similar to that of the close-talking system.

4.14 Hypercomplex Multispectral Image Processing and Prediction

Multispectral imaging is important for a variety of material analysisapplications, such as remote sensing, art and ancient writinginvestigation and decoding, fruit quality analysis, and so on. Inmultispectral imaging, images of the same object are taken at a varietyof different wavelengths. Since materials have different reflectance andabsorption properties at different wavelengths, multispectral imagesallow one to perform analysis that is not possible with the human eye.

A typical multispectral image dataset is processed using a classifier todetermine some property or score of the material. This process is shownin FIG. 48 , where multiple images are captured, represented usinghypercomplex numbers, and then classified using a hypercomplex deepneural network.

In FIG. 49 , the exemplary application of multispectral prediction ofapple (fruit) firmness and soluble solids content is demonstrated.Images of an apple are captured at numerous wavelengths and representedas hypercomplex numbers. Three images are shown in the figure, the butthe approach is applicable to any number of images or wavelengths. Next,the images are processed using an apple fruit image processing pipeline,and finally classified using a hypercomplex neural network.

4.15 Hypercomplex Image Filtering

Another exemplary application of the hypercomplex neural networksintroduced in this patent is color image filtering, where the colors areprocessed in a unified fashion as shown in FIG. 50 . Traditionally,color images are split into their color channels and each channel isprocessed separately, as shown in the top half of FIG. 50 . Thehypercomplex structures in this patent allow systems to process imagesin a unified manner, enabling the preservation and processing ofimportant inter-channel relationships. This is shown in the bottom paneof FIG. 50 .

4.16 Hypercomplex Processing of Gray Level Images

While processing of multichannel, color images has been discussed atlength in this application, hypercomplex structures may also be employedfor single-channel, gray-level images as well. FIG. 51 , shows anexample where a single-channel image is enhanced using any sort of imageenhancement algorithm (e.g. contrast enhancement, application oflogarithms, etc.). The original and enhanced images are subsequentlycombined and represented using hypercomplex numbers, which can beemployed with hypercomplex algorithms as described elsewhere in thisdocument.

4.17 Hypercomplex Processing of Enhanced Color Images

Standard color image enhancement and analysis techniques, such as, forexample, luminance computation, averaging filters, morphologicaloperations, and so on, may be employed in conjunction with thehypercomplex graph components/neural networks described in thisapplication. For example, FIG. 52 shows a system where a color image issplit into its components and is then processed by various enhancementalgorithms. Next, the original and enhanced image are combined to form ahypercomplex representation of the image data. This representation isthen used as input to a hypercomplex neural network to produce someoutput prediction, classification, or enhancement of the original colorimage.

4.18 Multimodal Biometric Identity Matching

Biometric authentication has become popular in recent years due toadvancements in sensors and imaging algorithms. However, any singlebiometric credential can still easily be corrupted due to noise: Forexample, camera images may be occluded or have unwanted shadows;fingerprint readers may read fingers covered in dirt or mud; iris imagesmay be corrupted due to contact lenses; and so on. To increase theaccuracy of biometric authentication systems, multimodal biometricmeasurements are helpful.

An exemplary application of the hypercomplex neural networks in thispatent application is multimodal biometric identity matching, wheremultiple biometric sensors are combined at the feature level to enabledecision making based on a fused set of data. FIG. 53 shows an exemplarysystem that employs facial photographs and fingerprints. The face andfingerprint data is digitized and sent to a hypercomplex deep neuralnetwork, which learns an appropriate hierarchical data representation tocombine the two sensors. Faces and fingerprints are also stored in adatabase, and the in-database data is processed using a copy of the deepnetwork so that the matching module is matching equivalent data. Aftermatching, another hypercomplex neural network may be used forclassification and, finally, a decision module can accept or reject thatthe input data belongs to a person of a given identity. The face in FIG.53 is drawn in grayscale but may also be presented in color; there arenumerous examples of color image processing throughout this patentapplication.

Advantages of unified multimodal processing include higher accuracy ofclassification and better system immunity to noise and spoofing attacks.

4.19 Multimodal Biometric Identity Matching with Autoencoder

To further extend the example in Section 4.18, one may add an additionalhypercomplex autoencoder, as pictured in FIG. 54 . The (optionallystacked) autoencoder serves to create unsupervised representations ofthe training dataset, thereby reducing the feature space for the deephypercomplex network. During training, multiple copies of the input dataare each corrupted by independent sets of noise and trained into theautoencoder. This serves to improve generalization performance and therobustness of the feature space learned by the autoencoder. Note thatany other feature space reduction method may be employed, and that theautoencoder is merely an example of a feature reduction technique.

4.20 Multimodal Biometric Identity Matching with Unlabeled Data

In biometrics, it is frequently the case that large quantities ofunlabeled data are available, but only a small dataset of labeled datacan be obtained. It is desirable to use the unlabeled data to enhancesystem performance through pre-training steps, as shown in FIG. 55 . Asin Section 4.19, an autoencoder is trained with noisy copies of data tocreate a lower-dimensional representation of the input data. However, inthe system of FIG. 55 , additional unlabeled data is trained into theautoencoder.

Particularly in the case of facial images, most labeled databases havewell-lit images in a controlled environment, while unlabeled datasets(e.g. from social media) have significant variations in pose,illumination, and expression. Therefore, creating data representationsthat are capable of representing both types of data will enhance overallsystem performance.

4.21 Multimodal Biometric Identity Matching with Transfer Learning

A potential problem with the system described in Section 4.20 is thatthe feature representation created by the (stacked) autoencoder mayresult in a feature space where each modality is represented by separateelements; while all modalities theoretically share the same featurerepresentations, they do not share the same numerical elements withinthose representations. An exemplary solution to this is shown in FIG. 56, where all modalities are forced to share elements in the featurespace.

In this proposed system, matching sets of data are formed from, forexample: (i) anatomical characteristics, for example, fingerprints,signature, face, DNA, finger shape, hand geometry, vascular technology,iris, and retina; (ii) behavioral characteristics, for example, typingrhythm, gait, gestures, and voice; (iii) demographic indicators, forexample, age, height, race, and gender; and (iv) artificialcharacteristics such as tattoos and other body decoration. Whiletraining the autoencoder, one or more modalities from each matchingdataset are omitted and replaced with zeros. However, theautoencoder/decoder pair is still trained to reconstruct the missingmodality, thereby ensuring that information from the other modalities isused to represent the missing modality. As with FIG. 54 and FIG. 55 ,noise is repeatedly sampled and added to the autoencoder training set inorder to enhance the generality of the learned representation.

4.22 Clothing Identification

The fashion industry presents a number of interesting exemplaryapplications of hypercomplex neural networks. In particular, mostlabeled data for clothing comes from retailers, who hire models todemonstrate clothing appearance and label the photographs withattributes of the clothing, such as sleeve length, slim or loose, color,and so on. However, many practical applications of clothing detectionare relevant for so-called “in the street” clothes. For example, if aretailer wants to know how often a piece of clothing is worn, scanningsocial media could be an effective approach, provided that the clothingcan be identified from photos that are taken in an uncontrolledenvironment.

Moreover, in biometric applications, clothing identification may behelpful as an additional modality.

An exemplary system for creating a store clothing to attributeclassification system is shown in FIG. 57 . This system is useful forlabeling attributes of in-store clothing that does not have attributesalready labeled by a human. In FIG. 58 , a (optionally stacked)hypercomplex autoencoder is used to create a forced sharedrepresentation of in-street and in-store clothes. As with thee biometricapplication in Section 4.21, the autoencoder is trained by selectivelydisabling one of the input modalities (in this case, clothing photo) butforcing the entire system to reproduce the missing modality. Duringclothing identification, the system is always running with the“in-store” input modality missing and, by training in the earlier step,produces the correct “in-store” output given the “in-street” input. Thisoutput is then matched by database to the clothing identification,product number, stored attributes, etc.

5. Flowchart Diagram for Hypercomplex Training of Neural Network

FIG. 59 is a flowchart diagram illustrating an exemplary method fortraining one or more neural network layers, according to someembodiments. According to various embodiments, the neural network layersmay be part of a convolutional neural network (CNN) or a tensor neuralnetwork (TNN). Aspects of the method of FIG. 59 may be implemented by aprocessing element or a processor coupled to a memory medium andincluded within any of a variety of types of computer devices, such asthose illustrated in and described with respect to various of theFigures herein, or more generally in conjunction with any of thecomputer systems or devices shown in the above Figures, among otherdevices, as desired. For example, the processing element may be acentral processing unit (CPU), a graphics processing unit (GPU), a fieldprogrammable gate array (FPGA), an application specific integratedcircuit (ASIC), and/or any other type of processing element. In variousembodiments, some of the elements of the methods shown may be performedconcurrently, in a different order than shown, may be substituted for byother method elements, or may be omitted. Additional method elements mayalso be performed as desired. As shown, the method of FIG. 59 mayoperate as follows.

At 502, a hypercomplex representation of input training data may becreated. In some embodiments, the input training data may comprise imagedata (e.g., traditional image data, multispectral image data, orhyperspectral image data. For example, in some embodiments, eachspectral component of the multispectral image data may be associated inthe hypercomplex representation with a separate dimension inhypercomplex space), and the hypercomplex representation may comprise afirst tensor. In these embodiments, a first dimension of the firsttensor may correspond to a first dimension of pixels in an input image,a second dimension of the first tensor may correspond to a seconddimension of pixels in the input image, and a third dimension of thefirst tensor may correspond to different spectral components of theinput image in a multispectral or hyperspectral representation. In otherwords, the first tensor may separately represent each of a plurality ofspectral components of the input images as respective matrices ofpixels. In some embodiments, the first tensor may have an additionaldimension corresponding to depth (e.g., for a three-dimensional “image”,i.e., a multispectral 3D model).

In other embodiments, the input training data may comprise text and theCNN may be designed for speech recognition or another type oftext-processing application. In these embodiments, the hypercomplexrepresentation of input training data may comprise a set of datatensors. In these embodiments, a first dimension of each data tensor maycorrespond to words in an input text, a second dimension of each datatensor may correspond to different parts of speech of the input text ina hypercomplex representation, and each data tensor may have additionaldimensions corresponding to one or more of data depth, text depth, andsentence structure.

At 504, hypercomplex convolution of the hypercomplex representation(e.g., the first tensor in some embodiments, or the set of two or moredata tensors, in other embodiments) with a second tensor may beperformed to produce a third output tensor. For image processingapplications, the second tensor may be a hypercomplex representation ofweights or adaptive elements that relate one or more distinct subsets ofthe pixels in the input image with each pixel in the output tensor. Insome embodiments, the subsets of pixels are selected to comprise a localwindow in the spatial dimensions of the input image. In someembodiments, a subset of the weights map the input data to ahypercomplex output using a hypercomplex convolution function.

In text processing applications, 504 may comprise performinghypercomplex multiplication using a first data tensor of the set of twoor more data tensors of the hypercomplex representation with a thirdhypercomplex tensor to produce a hypercomplex intermediate tensorresult, and then multiplying the hypercomplex intermediate result with asecond data tensor of the set of two or more data tensors to produce afourth hypercomplex output tensor, wherein the third hypercomplex tensoris a hypercomplex representation of weights or adaptive elements thatrelate the hypercomplex data tensors to one another. In someembodiments, the fourth hypercomplex output tensor may be optionallyprocessed through additional transformations and then serve as inputdata for a subsequent layer of the neural network. In some embodiments,the third output tensor may serve as input data for a subsequent layerof the neural network.

At 506, the weights or adaptive elements in the second tensor may beadjusted such that an error function related to the input training dataand its hypercomplex representation is reduced. For example, a steepestdescent or other minimization calculation may be performed on theweights or adaptive elements in the second tensor such that the errorfunction is reduced.

At 508, the adjusted weights may be stored in the memory medium toobtain a trained neural network. Each of steps 502-508 may besubsequently iterated on subsequent respective input data to iterativelytrain the neural network.

Further modifications and alternative embodiments of various aspects ofthe invention will be apparent to those skilled in the art in view ofthis description. Accordingly, this description is to be construed asillustrative only and is for the purpose of teaching those skilled inthe art the general manner of carrying out the invention. It is to beunderstood that the forms of the invention shown and described hereinare to be taken as examples of embodiments. Elements and materials maybe substituted for those illustrated and described herein, parts andprocesses may be reversed, and certain features of the invention may beutilized independently, all as would be apparent to one skilled in theart after having the benefit of this description of the invention.Changes may be made in the elements described herein without departingfrom the spirit and scope of the invention as described in the followingclaims.

What is claimed is:
 1. A method for training one or more convolutionalneural network (CNN) layers, the method comprising: by a computerprocessor having a memory coupled thereto: creating a first layer of theCNN comprising a hypercomplex representation of input training data,wherein the hypercomplex representation of the input training datacomprises a first tensor, wherein a first dimension of the first tensorcorresponds to a first dimension of pixels in an input image, wherein asecond dimension of the first tensor corresponds to a second dimensionof pixels in the input image, wherein the first dimension is orthogonalto the second dimension, and wherein a third dimension of the firsttensor corresponds to different spectral components of the input imagein a multispectral representation; and performing hypercomplexconvolution of the first tensor with a second tensor to produce a thirdoutput tensor, wherein the second tensor is a hypercomplexrepresentation of first weights that relate one or more distinct subsetsof the pixels in the input image with each pixel in the output tensor,and wherein the third output tensor serves as input training data for asecond layer of the CNN; adjusting the first weights in the secondtensor such that an error function related to the input training dataand its hypercomplex representation is reduced; and storing the adjustedfirst weights in the memory as trained parameters of the first layer ofthe CNN.
 2. The method of 1, wherein the distinct subsets of pixels areselected to comprise local windows in the spatial dimensions of theinput image.
 3. The method of claim 1, wherein the input image comprisesa three-dimensional image, wherein a third dimension of the first tensorcorresponds to a depth of pixels in the input image, and wherein thedepth is orthogonal to the first dimension and the second dimension. 4.The method of claim 1, wherein adjusting the first weights in the secondtensor such that the error function related to the input training dataand its hypercomplex representation is reduced comprises performing asteepest descent calculation on the first weights the second tensor. 5.The method of claim 1, the method further comprising: creating a secondlayer of the CNN comprising a hypercomplex representation of the thirdoutput tensor, wherein the hypercomplex representation of the thirdoutput tensor comprises a fourth tensor, performing hypercomplexconvolution of the fourth tensor with a fifth tensor to produce a sixthoutput tensor, wherein the fifth tensor is a hypercomplex representationof second weights; adjusting the second weights in the fifth tensor suchthat an error function related to the third output tensor and itshypercomplex representation is reduced; and storing the adjusted secondweights in the memory as trained parameters of the second layer of theCNN.
 6. The method of claim 1, the method further comprising:iteratively repeating said creating, performing, adjusting and storingon one or more subsequent layers of the CNN.
 7. The method of claim 1,wherein performing hypercomplex convolution of the first tensor with thesecond tensor comprises performing quaternion multiplication of thefirst tensor with the second tensor.